Terms

This blog is for educational and informational purposes only. The contents of this blog are opinions of the author and should not be interpreted as investment advice. The author takes no responsibility for the investment decisions of other individuals or institutions, and is not liable for any losses they may incur. By reading this blog, you are agreeing that you understand and agree to the terms above.

Thursday, June 30, 2011

Statistical Arbitrage: Part I

Statistical arbitrage is a type algorithmic trading technique that relies on hedging all exposure to risk factors in order to profit from small, mean-reverting, predictable movements in security/currency/commodity prices. Put differently, statistical arbitrage is a quantitative system of trading one instrument against a smaller amount of many others, which allow for macroeconomic variables typically affecting prices of the first instrument to be hedged by similar exposure in the opposite direction by the others.

To do statistical arbitrage, we use a multivariate regression equation, which assumes the instantaneous rate of change in an instrument's price, divided by its price, is equal to its alpha times the change in time, plus sigma for all i of a sensitivity 'beta sub i' times the current value of factor i, and finally, plus an error.
 
In some cases, if the alpha is low, it may be safely ignored, and the value of the instrument then depends exclusively on the betas, the risk factors, and the cointegration residual. By buying 'beta sub i' portfolios of the risk factor i (which can be described by a portfolio that is revealed by principal components analysis), for each risk factor i, we get a complete portfolio whose value is determined only by the cointegration residual of the single instrument and its alpha, which may or may not be zero. (If the cointegration residual has a nonzero drift, this will be our alpha).

This is an extremely simple procedure. Now that we have a portfolio whose value is to be a synthetic time series that we must model. And that's where approaches diverge. Ornstein-Uhlenbeck processes are the most simple, robust, and widely used model that can be used to describe the cointegration residual, but they may not be the best, since they make many assumptions about the time series without any data to support them. For instance, an Ornstein-Uhlenbeck process assumes that the synthetic time series is stationary, has Gaussian distribution with fixed mean, and has a constant, linear mean-reversion speed dependent on distance from the mean. Problems with these assumptions are permanent changes in the alpha value which means that the cointegration residual becomes non-stationary until the problem is corrected and the new drift is grouped as alpha (hence the need to regularly use regression to separate out this drift, while still not having a large sample size to perform the regression accurately, can be a problem.) Next, if the distribution is non-Gaussian, it would be more efficient to capitalize on than to theorize the issue away. Lastly, reversion speed can be better described by something other than distance. It may actually not be proportional to the distance from the mean, if less participants in the market are willing to risk doing Stat Arb in an Extremistan (and not just out-of-equilibrium) environment, and more might be willing stat arbitrageurs when the the residual is small, pushing it too far out of equilibrium on the opposite side. Lastly, if, to avoid this, market participants avoid holding the instrument once it gets past a certain point--say a certain z-score--we would experience a situation where instruments bounced back (at least occasionally) when they get too close to equilibrium.

Of course I don't know what would happen and how it would happen, but Bayesian Networks have more promise in this regard than their rivals, linear stochastic differential equations. Nonlinear Fokker-Planck Equations seem to have some potential for replacing black box predictions like those of Artificial Neural Networks or Hidden Markov Models, but the issue is that there is no great way to derive the 'correct' formula. Either way, we must make assumptions as to the structure of such an equation before we can try to calibrate it.

That all being said, and knowing I'm a fan of hidden Markov models, I would highly recommend those the most for a stat arb prediction engine.

No comments:

Post a Comment