The Grahamian

Saturday, August 25, 2012

Predictive Models, Meta-Optimization, and the Importance of Parameters

Most of my favorite statistical models have parametric, so when I seek to apply them to financial markets, I need some way to train them until they learn the structure of the data in question.

And yet some of my favorite optimization algorithms are also parametric. The predictive ability of parametric models is obviously affected by their ability to properly learn from the data. And what defines 'properly?' Well, unless we're changing the structure of the model or the optimization algorithm, both of which warrant whole other discussions, we want to find optimal parameters--ones that reflect the data. But to undertake that search with a parametric search algorithm is (a hyperbole, perhaps) as bad as not optimizing the model parameters at all. So what we need is to see what means of learning--what optimization parameters--give the model the best ability to learn. To do this requires we again search a parameter space. Sound familiar?

So we do, despite it seeming as though we have only postponed our problem. How can we search the parameter space of an algorithm if we couldn't search the parameter space of the model? Well, we could apply the same algorithm (with some unadventurous default parameters itself) to search it's algorithm's parameter space. In other words, it would go through the parameter space, settle on some spot, and use the algorithm with those parameters to train the model, and then measure the performance of the model. And if we so chose, we could try to optimize the parameters of the algorithm searching the algorithm parameter space. But that is beginning to sound ridiculous.

Another approach is to use a search algorithm that uses fewer parameters, or none at all. But this may mean settling on a set of algorithm parameters that aren't as good as they could be. This then means that we have not properly learned how to learn, as our approach was too rigid. Ultimately, this becomes a judgement call.

I personally like training my hidden Markov models with particle swarm optimization, but this is computationally intensive, and it is unclear what the social weight, cognizant weight, and momentum should be for the swarm. Applying PSO to the PSO parameter space has worked nicely. So has a cuckoo search.

Wednesday, December 28, 2011

Multistrategy and Multisignal Portfolio Construction

The best portfolios are, no doubt, ones that earn the highest returns with the least volatility. After ascertaining that you possess several informative signals for the same group of securities, you may be wondering how to capitalize on these signals as a whole. If the signals are high-quality enough to predict specific returns, rather than just being the simple "buy/sell" type, then you can take one of two possible approaches:

1) If you are deciding how to allocate capital to different strategies, run 'backtests' to determine the historical average return and the standard deviation of each strategy, as well as the historical correlation between such strategies, and then use mean-variance portfolio optimization to find the Sharpe-optimal 'weighting' of each strategy in a multistrategy fund. While this is great for finished strategies that utilize completely different structures (for instance, using typically unrelated strategies like stat arb and global macro), it unfortunately does not solve the heart of the problem: if there are no strategies, but only signals, how do we construct a multisignal portfolio at the securities level?

2) use historical price data to generate signals, and use all signals from a given day in the past to do multivariate linear regression and determine the sensitivity in the output to the sensitivity in input. Alternately, we can train an artificial neural network to do something similar, and pray it doesn't overfit the data. Lastly, we can set up an machine learning system that uses fuzzy logic to 'reason' its way through investment decisions and portfolio construction, analyzing historical patterns of how certain signals contradict the others when something bad is about to happen. In such an event, the system can work out a more informed expectation of the returns, which can then be fed into a portfolio optimizer.

Thursday, June 30, 2011

Statistical Arbitrage: Part I

Statistical arbitrage is a type algorithmic trading technique that relies on hedging all exposure to risk factors in order to profit from small, mean-reverting, predictable movements in security/currency/commodity prices. Put differently, statistical arbitrage is a quantitative system of trading one instrument against a smaller amount of many others, which allow for macroeconomic variables typically affecting prices of the first instrument to be hedged by similar exposure in the opposite direction by the others.

To do statistical arbitrage, we use a multivariate regression equation, which assumes the instantaneous rate of change in an instrument's price, divided by its price, is equal to its alpha times the change in time, plus sigma for all i of a sensitivity 'beta sub i' times the current value of factor i, and finally, plus an error.

In some cases, if the alpha is low, it may be safely ignored, and the value of the instrument then depends exclusively on the betas, the risk factors, and the cointegration residual. By buying 'beta sub i' portfolios of the risk factor i (which can be described by a portfolio that is revealed by principal components analysis), for each risk factor i, we get a complete portfolio whose value is determined only by the cointegration residual of the single instrument and its alpha, which may or may not be zero. (If the cointegration residual has a nonzero drift, this will be our alpha).

This is an extremely simple procedure. Now that we have a portfolio whose value is to be a synthetic time series that we must model. And that's where approaches diverge. Ornstein-Uhlenbeck processes are the most simple, robust, and widely used model that can be used to describe the cointegration residual, but they may not be the best, since they make many assumptions about the time series without any data to support them. For instance, an Ornstein-Uhlenbeck process assumes that the synthetic time series is stationary, has Gaussian distribution with fixed mean, and has a constant, linear mean-reversion speed dependent on distance from the mean. Problems with these assumptions are permanent changes in the alpha value which means that the cointegration residual becomes non-stationary until the problem is corrected and the new drift is grouped as alpha (hence the need to regularly use regression to separate out this drift, while still not having a large sample size to perform the regression accurately, can be a problem.) Next, if the distribution is non-Gaussian, it would be more efficient to capitalize on than to theorize the issue away. Lastly, reversion speed can be better described by something other than distance. It may actually not be proportional to the distance from the mean, if less participants in the market are willing to risk doing Stat Arb in an Extremistan (and not just out-of-equilibrium) environment, and more might be willing stat arbitrageurs when the the residual is small, pushing it too far out of equilibrium on the opposite side. Lastly, if, to avoid this, market participants avoid holding the instrument once it gets past a certain point--say a certain z-score--we would experience a situation where instruments bounced back (at least occasionally) when they get too close to equilibrium.

Of course I don't know what would happen and how it would happen, but Bayesian Networks have more promise in this regard than their rivals, linear stochastic differential equations. Nonlinear Fokker-Planck Equations seem to have some potential for replacing black box predictions like those of Artificial Neural Networks or Hidden Markov Models, but the issue is that there is no great way to derive the 'correct' formula. Either way, we must make assumptions as to the structure of such an equation before we can try to calibrate it.

That all being said, and knowing I'm a fan of hidden Markov models, I would highly recommend those the most for a stat arb prediction engine.

Tuesday, June 28, 2011

Hidden Markov Models: Part II

Since hidden Markov models help researchers to find what sorts of observations tend to come after other types of observations, we have a situation where we can forecast stock behavior once we know what hidden state the security/commodity/currency/synthetic time series was most recently in. (By a synthetic time series, I mean some type of hedged position with a single value being bet on; cointegration residuals in statistical arbitrage, implied volatilities after delta hedging, and PCA-derived risk factor values.)

To successfully deploy capital using an HMM-driven technique, one needs to avoid overfitting the model to the data. I struggled with this. Closing prices from each day turned out to be too unpredictable to work, because the more time passed, the less the older patterns mattered; consequently, I was forced to limit the size of the training sequence I was using, so that the hidden Markov model would only bother trying to learn from the relevant data. Unfortunately, a training set of only 40 days is too small to work well. But 50 days is pushing it, and more than that is quite outdated any daily trading model.

The way around the problem was to use higher frequency data--because it would be relevant while still providing a wealth of information and hidden patterns. Besides all that, higher frequency data allows for more predictions to be made in any given day, and thus limits volatility in portfolio returns. (Real time prices are available through professional brokerage or subscriptions to specific Reuters or Yahoo Finance services. Delayed, but regularly updated prices are available on Google Finance, and no subscription is necessary.)

In order for a hidden Markov model--or any statistical strategy--to work, the trading techniques must be used many, many times. As the number of times the strategy is used increases, the variability in strategy's overall success decreases, and the strategy has more potential for a clean statistical edge to shine through. Conversely, if only a few instruments are held as a portfolio, the portfolio's return is less certain. Trading a few instruments with a prediction algorithm is like going spearfishing with a toothpick. It really is that impractical.

Also, if you know how to do something with a synthetic time series, do it. There tends to be much less variability in outcomes when unwanted risk factors are hedged, and thus much less uncertainty regarding the hidden Markov model's predictive ability.

Sunday, June 5, 2011

Hidden Markov Models

Hidden Markov models have been proven successful for speech recognition, and their success carries over to the prediction of financial time series. According to Patterson's The Quants, and Mallaby's More Money Than God, Renaissance Technologies owes a great deal of their success to hidden Markov models. Research by academics in this paper and this other paper further validated the financial utility of hidden Markov models, and papers such as this one demonstrated their superiority over GARCH(1, 1) models for accurate volatility modeling.

The major issue with using hidden Markov models to predict financial time series is that we are trying to forecast the inherently chaotic. Put differently, forcing HMMs to learn from raw financial data is not always the best idea because it forces them to learn to try and predict the outcome of Brownian motion. On the other hand, that's what information theory is supposed to be about--detecting and predicting signals through a 'noisy' passageway. So while HMMs can still certainly be used on financial data, it's a bit much to ask. The one glaring exception to this is the use of high frequency data, which contains more data and hence is more likely to contain some pattern or other that daily or longer-term data does not reveal. So if dealing with daily or longer-term data, it's a lot easier to do something that eliminates market noise and results in a more statistically calm, pattern-containing time series. Some such methods for hidden Markov models include statistical arbitrage, volatility arbitrage, correlation forecasting, and volume prediction.

Of course, HMMs can also be used even less directly; for instance, by doing information extraction--getting pure information from humans' news articles, such as those on Reuters.com. (My next post will discuss this briefly, and the one after that will talk about other information extraction methods.)

But the most fruitful, direct application of HMMs is in high frequency trading. Because they inherently sort returns into groups (with observations of these returns corresponding to certain probability distributions) that are the underlying 'states,' hidden Markov models can separate out statistically different price movements the same way they can distinguish between vowels and consonants in a two-state model. Put differently, the way underlying states fit together with each other means that even if observed returns are uncorrelated with each other across time, they may be related in a more subtle way: one single certain type of return may be followed by another certain type of return more often than by returns not belonging to that type. I'll leave the rest up to the reader's imagination and programming skills.

Wednesday, June 1, 2011

A New Kind of Global Macro

This is going to be kind of esoteric because I'm holding my cards close to the vest.

Imagine:
Since the entire basis of global macro is risk management, what if we could put portfolios together one risk factor at a time? Wouldn't that be interesting? Risk exposure is the beginning and the end of all global macro strategy. So why wait until the end to integrate it? It would be much better to bet on the risk factors right off the bat? Why focus on certain commodities, currencies, and corporation's securities (or the respective derivatives)? That was never the point of global macro. We care about capitalizing on macroeconomic changes anyway. Why get exposure to those changes from a few instruments when you could get right to the source, and just buy the factor itself?

It's simpler. It's more efficient. It's more potent. It's better diversified.

So the answer is to construct portfolios consisting of nothing but a few uncorrelated risk factors selected in whatever relative portion desired. The method driving the prediction of risk factors is not the issue. That is for another time. What we're concerned with is portfolio allocation once the decisions have been made.

And best of all? Portfolio optimization is easy. The risk factors are approximated by baskets of many instruments, and the instruments each have a covariance with each other. But that has already addressed by principal components analysis. We can merely treat each factor as an instrument to be bought. (This is reasonable--because it is like buying a stock, which is a basket of risk factors, only these ones have with non-zero sensitivity to only one risk factor.) Once we have bought these baby, de-facto 'stocks,' we realize that they have no correlation at all. Thus, the terms inside the first sigma in the Markowitz Mean-Variance model are eliminated, and the second one is easily maximized by investing the most money where the highest return is expected. This is similar to setting our risk adversity parameter to zero, (though we have done nothing of the kind,) since it is now trivial.

This may seem dangerous, because the allocation weights are not constrained, but a rule can be added separately, allowing a maximum position size in any given risk factor, as well as a maximum position and/or maximum portfolio variance. On top of that, we can use VaR systems and stress testing like any global macro fund would.

The exact method for predicting factors varies considerably. There are several approaches that could work, including the John Paulson approach, the Soros approach, and the Robert Frey approach.

Tuesday, May 31, 2011

The Data-Driven Manifesto

First and foremost, timing is everything. And not just in humor.

The correct timing is the sole requirement for profitability. Whether it be signals generated by a statistical model, or an important event in the news, all information is incorporated very quickly into market prices. Thus, the only thing that matters is timing each trade just right, whether that means capturing imbalances in supply and demand at the microstructure level, parsing news articles with information extraction algorithms, or betting on an trend reversal after seeing a familiar signal. There is no such thing as a temporally-irrelevant investment. Everything depends on time.

If data is old, one cannot use it to successfully invest. No fundamental data or technical data or obvious statistical data is necessary at all, because it is public knowledge that has already been incorporated into the price of the security, commodity, currency or derivative in question. Betting on a known fact is useless, because the market price already reflects that fact, and will not change due to the fact. How could it? The information has already been unveiled. It is not going to unveil itself again--it cannot become common knowledge twice!

However, if data suggests a statistical anomaly in the presumably efficient market, then it may be proper to act on it, assuming that others are not aware of the anomaly. Widely observed statistical phenomena such as the value, size, and momentum factors are not good foundations for trading strategies because the phenomena are already widely observed and therefore more risky; indeed, the more market participants using a strategy, the more potential that strategy has to underperform the overall market index.

The whole reason investment strategies are market-neutral is because their creators did not want to worry about predicting where investors, as a flock, would go next. Unfortunately, any strategy, with sufficient popularity, suffers from the same problem; people invest in the strategy, and then they withdraw capital, creating volatility in returns from the strategy. This is true for everything from statistical arbitrage to the failure of quantitative equity selection models based on value and size. The only reason that the strategies worked well was because few people used them, but rather their returns came from corrections. For instance, when stat arb was still new, success came in the form of lower returns for an outperforming stock. It wasn't that investors necessarily used stat arb and recognized that the value of the stock was too high, but rather that the strategies' successes were phenomenon unrelated to the direct investment decisions of any individual or institution. But once they became popular, the returns of such strategies was directly impacted by a greater portion of the market, that now knew of and traded the strategy directly. At that point, the strategies were inherently subject to the same fickleness that market indices have always been subject to.

Rather intuitively, the more participants using a strategy, the lower returns associated with it, and the lower the Sharpe ratio. The standard deviation of an overused strategy's returns are also much higher, again for intuitive reasons: the number of market participants is positively correlated with the number of multi-strategy investors; the number of strategies increases, the probability that at least one strategy fails increases as well--and eventually one of the many strategies is bound to fail; as the number of participants increases the number of investors being exposed most heavily to the failing strategy increases as well. As the number of investors being exposed most heavily to the failing strategy approaches infinity, the probability of one of these investors being heavily leveraged and consequently receiving a large margin call (as a result of the failing strategy) converges to unity. From that point, the participant with the margin call liquidates an arbitrary number of strategies' portfolios. If this number includes, say, a value/momentum book, then the value/momentum strategy will suffer from the unwinding of the participant's portfolio.

The problem, then, is not market crashes or strategy failure, ipso facto, but rather a spillover effect on other portfolios using the same, over-popular strategy. The solution is obvious: use strategies that other participants have (literally) never even heard of. The mere knowledge of a possible strategy may encourage participants to covertly experiment with it, and perhaps put it into practice without declaring it. Luckily, as more participants use the strategy, the Sharpe ratio of the strategy declines, and vigilant observation will allow the original users of the strategy to leave quietly before it fails in dramatic fashion.

* * * * *

The problem with theory-driven strategies is that they usually reject temporal trading rules.
For instance, CAPM, EMH, APT and MPT in general fail to account for the possibility of different expected returns across time. They cannot adjust to changing market conditions either, and their models often make too many restrictive assumptions.

The fullest implication of "data-driven" strategies is that their associated models are not merely created ahead of time and parametrically synchronized with data, but rather that the data itself determines the model's structure, and not just its parameters.

For instance, the number of hidden layers in an artificial neural network, the number of iterations of a genetic algorithm for portfolio selection, and the number of states in a hidden Markov model are all input by a human programmer. And yet, the decisions made by humans are the ones that constrict the model from becoming fully formed. Thus, these decisions need to be supported by the data

The optimization method for the most accurate model (as measured by the probability of the model producing the training sequence) should be the one that leads to global optimality. Hill-climbing algorithms can only guarantee local optimality and are therefore less desirable than algorithms that search for global maxima. This is intuitive, since the more accurate the model is, the better it represents the truth behind how the market moves and works.

Once the globally optimal model's structure is perfected through historical profitability tests, the model is ready to use, and no more human intervention is necessary. However, as a scientific experiment, it would be interesting to see what kind of unexpected connections, classifications, and procedures the models could come up with. It will be complex, and likely counterintuitive.

"Black box" has become finance-lingo for any algorithmic trading strategy without a simple, logical backing. The models' structures and parameters--or anything too complex or too counterintuitive for a human to understand--are labeled as "black-box" as if that was a bad thing. The fact that the strategy is obscure helps to avoid the crowding effect that leads to the downfall of every hyped investment strategy, from LTCM's fixed income arbitrage disaster to PDT's temporary Stat Arb troubles in August and November 2007. The more black-box it is, the less others are likely to catch on, and the better the strategy's performance will be. Put differently, assuming that it is sound, it won't be subject to failure on account of sheer popularity. From a bigger perspective, all data-driven strategies work well--assuming that they are not too well known, in which case their success is nothing but a house of cards that has been lucky not to suffer from a gust of wind--because they exploit market phenomena that do exist, rather than ones that ought to exist. Indeed, data-driven strategies are valid because they have been validated by the market.

Terms