Australian (ASX) Stock Market Forum

Getting Started in Machine Learning for Trading

Joined
13 June 2007
Posts
838
Reactions
136
This message could have been posted as a response to a question -- "How to get started with Python and machine learning?" -- asked in another thread

I think a separate thread devoted to tools and techniques for machine learning will be valuable. Here goes.

====================

Here is a link to the bibliography that is an appendix to the "Foundations" book.
http://www.blueowlpress.com/wp-content/uploads/2016/08/FT-Bibliography-Appendix-D.pdf

There are two areas you will need to study.
1. Python.
2. Machine Learning.

-------------------

Python is your base language. Unless you already have substantial experience with, and support for, R, look no further. If you are uncertain and trying to deciding between Python and R, choose Python. Do not learn another language in preparation of learning Python. The pandas library of Python is very similar to the libraries of R, so quite a lot of R experience will transfer to Python easily. But the data science profession is overwhelmingly moving to Python over R for application beyond statistics.

Download and install the Anaconda distribution of Python.
https://www.continuum.io/downloads

It is free. It is available for Windows, Mac, Unix/Linux. It is the widely accepted standard Python. Most texts recommend Anaconda.

There are two major versions -- Python 2 and Python 3. I am still using version 2. Version 3 has been available for several years. Machine learning depends on libraries that extend the capabilities of the base language. Python 2 and Python 3 have some incompatibilities. Many of those libraries are available for both versions, but not all. Progress is being made in converting everything to Version 3, but many practitioners continue with Version 2. The changes to the base language are minor and will not seriously confuse people programming straight Python. Learn either.

Anaconda Python comes with several development platforms. Two that you will want to consider are Spyder and Jupyter.
Spyder includes an editor and execution module all-in-one.
Jupyter is an outgrowth of iPython Notebook. It includes editing, execution, and documentation all-in-one.
You can sortof move back and forth between them, but I recommend picking one and using it exclusively.
To be clear -- installing Anaconda Python will automatically install both Spyder and Jupyter. Your choice is which to use day-by-day.
Juypter's website:
http://jupyter.org/

-------------------

For home study of Python, there are numerous texts, pocket guides, free online courses, and paid online courses.

I like the work of Dr. Allen Downey. He has written several books, including "Think Python" which can be legally downloaded for free:
http://greenteapress.com/thinkpython/thinkpython.pdf
Or buy a printed copy from Amazon.

Many people like the approach where the student does a lot of exercises -- not downloading or using cut and paste. "Learn Python the Hard Way" is one of the better. Here is a link to a version that can be read online for free:
https://learnpythonthehardway.org/book/
Or buy a printed copy from Amazon.

Coursera has offered several Python courses, ranging from absolute beginner to relatively advanced. Check to see what is available for the time period you plan to study. Some of the previous courses have been archived and resources, including videos of lectures, can be downloaded. Coursera is in the process of changing from free to paid. For most courses, but not all, you can still enroll and get access to the materials for free. I have watched the videos from several of these. None that I have seen are, in my opinion, excellent. Several are poor. Your method of learning will influence how effective each courses is for you.
https://www.coursera.org/courses?languages=en&query=python

---------------------

For home study of machine learning, there is much to learn and there are many sources.

Among the many points to keep in mind, one is very important. Building machine learning models to identify profitable trades requires everything that learning to differentiate between species of iris or determining whether a borrower is likely to repay a loan requires. It also requires that the time sequence organization of the data and the monotonic increase in efficiency of the markets as time progresses be recognized and properly dealt with. I know of no book or online material that adequately addresses these special requirements. Indeed, several seem to intentionally disregard them. Begin by watching my video on "The Importance of Being Stationary."
http://www.blueowlpress.com/video-presentations

For a basic university-level introduction to machine learning, Dr. Andrew Ng's Stanford Open Classroom course is very good:
http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

I also like Dr. Yaser Abu-Mostafa's Cal Tech Online Course:
https://work.caltech.edu/

To incorporate machine learning into Python, a key library is pandas. Pandas is a Python library for data handling, with particular features for time series. The pandas library was developed by Wes McKinney while he was an analyst at Cliff Asness' AQR Capital Management hedge fund. Wes has left AQR but continues to be active in the applications of machine learning. There are several videos of his presentations on YouTube. His book, "Python for Data Analysis," was the first of several that describe use of pandas:
https://www.amazon.com/Python-Data-...8064&sr=8-2&keywords=python+for+data+analysis

Dr. Jake Vanderplas is an astronomer at the University of Washington who is very active in use of Python, pandas, and machine learning. His book, "Python Data Science Handbook," is outstanding:
https://www.amazon.com/Python-Data-...&keywords=python+for+data+analysis+vanderplas
Also watch his presentation, many posted to YouTube.

For some of the details of machine learning techniques, I like Sebastian Raschka's books:
"Python Machine Learning"
https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130/ref=sr_1_2?ie=UTF8&qid=1488218373&sr=8-2&keywords=Raschka,+Sebastian
"Python, Deeper Insights into Machine Learning"
https://www.amazon.com/dp/B01LD8K994/ref=rdr_kindle_ext_tmb

--------------

There are many more resources available. But this much is probably already an overload. I hope this helps getting started.

Best, Howard
 
Last edited:
+1 for "Learn Python the Hard Way" if you are starting from scratch.

It also requires that the time sequence organization of the data and the monotonic increase in efficiency of the markets as time progresses be recognized and properly dealt with. I know of no book or online material that adequately addresses these special requirements. Indeed, several seem to intentionally disregard them.
This is the big problem with all the material on the net about ML and trading. It just ignores it. You soon see that people who are putting stuff out there might as well be doing weekend courses on Gann. Its just rubbish. The only material I have found is this dude,

https://www.google.com.au/webhp?sou...me+series+http://machinelearningmastery.com&*

But its advance stuff. It is taking me some time to get my head around it (but I'm dumbarse so maybe some of the quants in making will smash it) I would be interested in what you think of it Howard?

A good site to play around with once you have some basic Python skills is
https://www.quantopian.com/
Has the data on the site (US EOD and 1 min)and lots of examples. As its a back-testing/forward testing site based on Python it takes away a lot of the problems of getting the data into Python.

A site I keep on going to that is handy for little code snippets is,
http://chrisalbon.com/
 
Greetings --

MachineLearningMastery is the website and material of Jason Brownlee. I have most of his material, and it is quite good. (I believe he lives in Melbourne, Australia.) His most recent, and his only material related to time series as far as I know, is "Introduction to Time Series Forecasting with Python" which focuses on ARIMA-like models which are not very useful to traders.

I have not found much of value in Quantopian. But I am willing to be convinced otherwise. A lot of their material seems to be longer term and / or fundamental -- both of which have little to no value for trading.

I am skeptical of sites that do the hosting. I believe every trader should run his or her own code on his or her own computer. Learn from others, but when you have discovered something that works for you, be cautious about who else sees it.

Thanks for the introduction to Chris Albon's site. I will do some exploring.

----------------

A note to people who are not familiar with my work.

I have posted several videos on YouTube which will give some background into my thoughts, research, and experience. Start here:
http://www.blueowlpress.com/video-presentations

Pay particular attention to the material on risk. Begin with an assessment of personal risk tolerance, assess the risk of all financial systems being considered, normalize position size for risk, then estimate the future rate of return using the metric CAR25. Trade the system that has the highest CAR25. Position size must be kept out of the signal generation model. It is only useful when it is in the trading management model. Manage the trading day-by-day, adjusting position size using the dynamic position sizing technique as necessary to hold risk within tolerance. When the risk of the system being traded increases, which will be indicated by a drop in the CAR25 metric, take it offline and replace it with the then-best system.

The sweet spot is: trade often, trade accurately, hold a short period, avoid serious losses. Very few investment / trading system give a return that is high enough to beat risk-free use of funds if they have a holding period longer than about 5 days and/or accuracy lower than about 65%. Any serious losses or sequences of small losses will raise the probability of a serious drawdown, causing position size to drop, making the system untradable.

Best regards, Howard
 
Gents,
Very busy lately but I will try to follow your leads.
just want to thank you for your inouts
 
I have not found much of value in Quantopian. But I am willing to be convinced otherwise. A lot of their material seems to be longer term and / or fundamental -- both of which have little to no value for trading.
Not sure about that they have data that is 1min bars and Daily. So you can pick your time frame. But my point is that its useful to get a start. One big problem with Python is data. There is no simple off the shelf solution to run your algos on. You are going to have to write something to do portfolio testing on. That in itself will lead to having to know SQL or some other database. When starting it can be a step too far. Something like Quantopian has it all sorted so you just program in Python.... and the data is free!

I am skeptical of sites that do the hosting. I believe every trader should run his or her own code on his or her own computer. Learn from others, but when you have discovered something that works for you, be cautious about who else sees it.

Yeah I probably agree. Once you get to do any sort of decent work I too would go private. But lets face it as a first step we are not going to be doing work that will be putting Jim Simons out of business so for the ease of use and free data I would have a look.
 
One big problem with Python is data

I really have very limited experience in Python, however i would have thought that data handling would be one of it's strengths.

SQL ? .... for this type of analysis don't overlook flat files. Flat files are easier, better and way faster by orders of magnitude. Stay away from SQL unless you are really convinced you need it, the overhead is too expensive. The old Computrac -> Metastock filesystem that most programs can read is still one of the fastest, even with converting all the floats from MSBIN to IEEE standards. Premium data still push their data in that format and Amibroker still reads it. I would suggest learning to read your data from there, i am sure Howard sent that link in another thread that has all the nuts and bolts required.

On a side note, avoid using CSV files as working files because you are converting strings to numbers which is expensive too. Use them exclusively as import export to and from other programs, but load and save your working data in a flat file or series of flat files.

Data handling and your data structure is the first thing you need to get right, do it wrong and your program will be slow and clunky and will suffer forever until you do it right.

THEN you can start looking at machine learning algorithms.
 
I really have very limited experience in Python, however i would have thought that data handling would be one of it's strengths.

THEN you can start looking at machine learning algorithms.
Yeah mate thats kinda my point. Its all very well to say this is the best. But for someone learning you want to avoid re-inventing the wheel - just to get to the start of doing the real work. Python is good.... once you get in there. As it stands there is no simple solution to getting your data files like Premium data into Python and running portfolio tests. Very easy to do single files but that is useless for testing across a whole market. Nothing that I know of.

Anyone?
 
I heard that you can link Amibroker to Matlab and Matlab has lots of options for machine learning.

Has anyone linked Matlab to Amibroker?
 
Years ago I was teaching computer science as Microsoft began publishing Windows, Word, and Excel. The conversations were similar then -- Windows versus MSDOS versus CP/M -- Excel versus VisiCalc -- Word versus WordPerfect. Now, for us, Python versus R.

The two primary reasons to pick one over the other are capabilities and support.

To test capabilities. Install both and perform the functions that will be used over the life of the project.

To test support. The bookstore test helped then -- an equivalent internet test might be helpful now. I sent students to local bookstores -- Borders, Barnes & Noble, etc -- with the assignment of noting the number and quality of books for each. Also to read the employment ads to see which skills were most in demand.

For several years, I have heard the discussion about advantages and disadvantages of R and Python. In terms of capabilities, and in keeping with my own advice, I tried to develop machine learning trading systems in both. I worked through an entire trading system application -- from data acquisition through data munging, data transformations, train-test splits, crossvalidations, model selection, hyperparameter selection, model fit, model storage, model retrieval, prediction on new data, reporting and emailing results, running dynamic position sizing. I found that Python was easier to work with, provided a consistent set of tools, and allowed me to focus on the trading system aspects of the project.

In terms of support, first consider knowledge you already have and support you will receive from your friends and employer. If that is heavily oriented toward R, then choose R. Otherwise, do the bookstore and internet test. Look for the reference material you will need -- tutorials, books, websites. In my opinion, Python has a better support base.

You choose. Pick one. Become an expert in programming in it, in knowledge and use of the libraries you will need.

Then focus on machine learning for trading.

In very broad terms, the steps and components of developing a machine learning trading system are:
Data acquisition -- free or subscription services.
Data munging -- alignment, identifying errors, correcting or dropping erroneous data.
Data transformation -- create indicators/predictors, lagged values, prediction target.
Data selection -- extract in-sample data, reserving out-of-sample data.
Model selection -- decision tree, support vector, emsemble, etc.
Model fit -- learning.
Model test -- validation.
Model evaluation -- simple metrics, computing safe-f, CAR25.
Model storage -- save to disk for future use without refitting.

The steps and components of using a machine learning trading system are:
Data acquisition -- gathering current values of the data series.
Data munging.
Data transformation -- applying the same transformations that were used in development.
Model retrieval from disk.
Prediction using the stored model and the new data.
Determination of system health and position size based on recent performance.
Trading based on the prediction.

------------------

As I have written, the sweet spot is to use daily data, trade a single issue, long/flat or short/flat. Trade frequently, trade accurately, hold a short period, and avoid serious losses. That is fortunate, because those characteristics are the easiest to model. Pick one issue, choose the target to be whether the next close is higher or lower than the current close, mark-to-market daily, manage daily.

The concept of a trading system changes considerably. There are no preset rules. The model decides what is important from analysis of the training data. It is data mining, searching for signals among the noise.

The result will be a series of state signals, each valid for one day. That is -- a sequence of "beLong" or "beFlat" states for a single tradable issue, each one day long. Holding periods longer than one day will be indicated by several "beLong" states in succession. There are no maximum loss stops, no trailing stops, no imposed holding periods, no predefined critical parameter values, no portfolios, no position sizing. If you are not comfortable with this, machine learning-based trading system will not work for you.

------------

In response to some of the comments made.

I think Python with Pandas is an easy solution to handling input data. There are several free sources of daily data -- including Quandl, Yahoo, Google. All of these come with the caveat that using free data moves some of the data quality issues to the end user. Quandl also has subscription data. I have not evaluated Quantopian's data.

If a single source does not provide every data series you need, it is easy to gather data from multiple sources, store them in one or more Pandas dataframes, and use Pandas utilities to align, adjust, combine, etc.

Long-lookback indicators and long-term filter indicators are not helpful. The modeling process is searching for signals. A signal can only occur when there are changes within the data. In order to generate, say, 50 signals a year, there must be 50 changes in predictor values each year. The data analysis routines (often hidden) within each model will evaluate and discard data series that are overly redundant -- that either duplicate other series or present long sequences of constant values.

In the end, you will wind up with one or two indicators that oscillate at about the same frequency as the signals you are looking for.

The signals generated by the model will correspond to the bars are used to model. Each bar results in one row in the data matrix presented to the modeling routine. There will be a signal for each bar. If you plan to trade one-minute bars, then model one-minute bars. Modeling one-minute bars, or even hourly bars, will not help if your plan is to trade daily.

Similarly, using daily bars will not help if your plan is to trade once a week. But, if you are managing trades less frequently than daily the risk of the position is certain to exceed your risk tolerance, and the CAR25 value will suggest that the system not be traded.

Similarly for portfolios. CAR25 is a Dominant metric. Given two or more trading systems, each with signals to enter and exit a single issue, pick the one with the highest CAR25. Splitting trading funds to take two positions and form a portfolio might provide a feeling of comfort due to diversification. What it really does in ensure that some of the funds are being used sub-optimally.

Database issues do not arise. Once the data has been read into the dataframe, there is no use of external data storage -- no need for csv files, no need for SQL databases. After fitting, the trained model will be stored on disk. The main component of the model is a matrix of coefficients that are the solution to the AX = Y set of simultaneous equations -- the "A" matrix. Python stores that for us, and we have no control over the format of the storage. When we later want to use the model, Python retrieves the model. All we need to do for both operations is provide the file name and path we want to use.

Best regards, Howard
 
I reread the thread. There seems to be a question of efficiency of data access. This is the complete sequence necessary to load 20 years of daily data for SPY into a pandas dataframe named qt:

import Quandl
qt = Quandl.get("GOOG/NYSE_SPY")

best, Howard
 
I reread the thread. There seems to be a question of efficiency of data access. This is the complete sequence necessary to load 20 years of daily data for SPY into a pandas dataframe named qt:

import Quandl
qt = Quandl.get("GOOG/NYSE_SPY")

best, Howard
:banghead::banghead:

Thats not the issue!!!! It's one instrument! If I wanted to do a test against the S&P500 constituents your code blows out to 1000 lines just for the data query of poor quality free data.

Show me some code to feed into and filter data that is actually useful. I have Norgates ASX data in a
C:\Trading Data\Stocks\ASX\Equities which is in 24 folders and has countless .dat files.
I also have 37.5 Gb of futures data spread over 849,803 files in 1000s of folders in a flat files. I'd love for you to show me how to do a portfolio test against that with two lines of code. I can read each individual file but that doesn't mean its an easy task to consolidate a PORTFOLIO to test against.

I have stated this a few time that one of the big problems with Python or any of the 'new' programming languages is the woeful practical applications outside of overly simplistic examples like Howard has just given. The 'old' backtesting software where programmers have done all the nutz'n'boltz work for you saves you hundreds of hours and saves you from having to re-invent the wheel just to get started. Which all brings me back to quantopian.... they have done nutz'n'boltz.
 
Hi TH --

From your comments, you have not read and worked through the examples of risk-normalized profit potential.

Here is a brief summary:

Begin by analyzing the price data itself before applying a model. There is a procedure called the Data Prospector, fully disclosed, that will analyze the risk and profit potential of any data series, even before attempting to develop a model to trade it. Some issues will have too little volatility to provide profit; some too much to be tradable. There is a middle group of goldilocks issues that Might work -- we do not know yet. Continue with that group and check liquidity. I recommend that there be enough liquidity so that the trader can exit his or her entire position in any minute of any day without substantially affecting the bid-ask spread. I also look for bid-ask spreads that are one cent at almost any time.

There will be a list of a couple dozen issues that pass those filters. Now begin system development. Try to model each one long/flat. Or short/flat (but this is much harder). Each data series and its associated model create a trading system -- a system that trades a single issue long/flat. Validate each of those individually to ensure that the system is likely to be profitable in the future.

Define your personal risk tolerance. For example, wanting to hold the risk of a drawdown in excess of 20% to a chance of less then 5%. Each system you trade will have its position size adjusted trade-by-trade to keep it within your risk tolerance.

Apply the risk analysis to each of the several systems that look promising. Each has a maximum safe position size, safe-f. This is the portion of funds that can be used to take positions. The remainder of the trading account must stay in a risk-free account to act as ballast to compensate when the funds traded enter a drawdown. When traded at safe-f, each system has a profit potential. It can be quantified. It is called CAR25.

CAR25 is a Dominant metric. The best use of funds is to trade the single system with the highest CAR25.

Expect distribution drift. Monitoring each of the systems day-by-day, compute CAR25 for each. As trading performance changes, trade the one that has the highest CAR25. If the CAR25 of the one at the top of the list is not higher than the risk-free funds, do not trade any of the systems.

Portfolio construction is not necessary, or even desirable. Assume there are two systems, one with CAR25 of 12%, the other with CAR25 of 22%. Assume safe-f is 100% for both, so all funds are available to buy shares. Forming a two-issue portfolio creates a profit stream that is 17% -- half from the 12% system, and half from the 22% system. The trader is better off using all funds for the 22% system. Watching carefully to switch when some other system shows better performance as indicated by CAR25.

Best regards, Howard
 
Greetings --
Regarding: Thats not the issue!!!! It's one instrument! If I wanted to do a test against the S&P500 constituents your code blows out to 1000 lines just for the data query of poor quality free data.

-----------------

You will have read in my post of a few minutes ago my recommendation for issue selection and portfolio construction -- that each system be a single issue traded long/flat or short/flat. If a person wanted to analyze many issues, all in the same run, with all data in core at the same time, here is how.

-----------------

First -- choose poor quality free data or curated premium data. There is no change other than the string identifying the ticker of the issue to be loaded.

The example I posted used free data which can be sourced, at your preference, from Google or Yahoo. The call is exactly the same to use curated premium data. It changes from:
qt = Quandl.get("GOOG/NYSE_SPY") # Free data
to:
qt = Quandl.get("EOD/SPY") # Curated premium data

----

From the Quandl site:
"Quandl hosts several commercial-grade "premium" databases in addition to our free data. These premium databases are of a higher quality, accuracy, timeliness and documentation standard than our free data and are intended for professional use.
Premium databases are not free; you have to subscribe to access them. However, we do offer generous free trials so that you can try before you buy."

-----------------

Then -- to load many issues into core.

To load the prices for several issues, or perhaps several thousand issues (there is no limit), the code changes very little. There will be a call to the data server for each issue, but the block of code is three lines longer (a little shy of 1000 lines) in order to store any number of additional data series into additional columns of the dataframe.

To fill a dataframe with prices from a few issues (slightly pseudo-coded):
tickerlist=["IBM","SPY",...,"AAPL"] # Or read the tickerlist from a diskfile, perhaps stored as a watchlist
for issue in tickerlist:
colName = issue
qt[colName] = Quandl.get(issue)

--------------------

I understand that many people will be resistant to the ideas being suggested. Questions are welcome. Civil questions preferred. If, after working through the math involved, this approach is not for you, please revert to lurking rather than disruption.

There is math -- as already seen in the data prospector, safe-f, and CAR25 -- and there will be much more math as we get into machine learning.

----------------

Best regards, Howard
 
:banghead::banghead:
I have stated this a few time that one of the big problems with Python or any of the 'new' programming languages is the woeful practical applications outside of overly simplistic examples like Howard has just given. The 'old' backtesting software where programmers have done all the nutz'n'boltz work for you saves you hundreds of hours and saves you from having to re-invent the wheel just to get started. Which all brings me back to quantopian.... they have done nutz'n'boltz.

Greetings --

This isn't about Python. It is about machine learning. Specifically for trading. Python is the base language which provides access to a set of libraries that implement the machine learning fitting and testing routines we will discuss and develop.

If something being used is already providing satisfactory results, look no further. Continue to use it. Please do not disrupt this thread.

The vast majority of traditional trading system development platforms implement "old backtesting software" that is based on simple decision trees. That is the simplistic part.

By the time we get to the end of this thread, there will be few readers who think machine learning, and the scientific, data driven, distribution oriented, Bayesian, and crossvalidation-based techniques it uses, are simplistic. Machine learning broadens available models to include dozens of techniques that are often significantly better -- higher risk-normalized profit -- than individual decision trees. There will be some math.

Best regards, Howard
 
Greetings --

This isn't about Python. It is about machine learning. Specifically for trading. Python is the base language which provides access to a set of libraries that implement the machine learning fitting and testing routines we will discuss and develop.

If something being used is already providing satisfactory results, look no further. Continue to use it. Please do not disrupt this thread.

The vast majority of traditional trading system development platforms implement "old backtesting software" that is based on simple decision trees. That is the simplistic part.

By the time we get to the end of this thread, there will be few readers who think machine learning, and the scientific, data driven, distribution oriented, Bayesian, and crossvalidation-based techniques it uses, are simplistic. Machine learning broadens available models to include dozens of techniques that are often significantly better -- higher risk-normalized profit -- than individual decision trees. There will be some math.

Best regards, Howard

Greetings Howard,
Most of us don't have time to write a stock selecting, stock trading program. But I think its a great idea if you can get it working. It would be great to leave all the hard analysing to an AI that would automatically trade and make profits. If you could perfect it it would be worth millions. I would be interested, keep up the good work and give us updates when you can.
 
Trillionaire --

1,000,000 bank accounts, each with a balance of $1,000,000. Or the equivalent.

Earning 1% per year, the return would be $10,000,000,000 per year -- over $25 million per day.

In accumulating funds, there is risk. Whenever there is risk, there are two absorbing boundaries -- winning and bankruptcy. When there is a non-zero chance of bankruptcy, no matter how small, a prudent person would decide what level constituted "enough" and remove funds from further risk.

Assuming living costs are not significantly different than they are today --- Why?? Money decreases in utility as the accumulation of it increases. After buying a reasonable, or even an excessive, house, what does the 10th million enable that the 9th did not? Or 110th and 109th. Or fill in your own numbers. But at some point one addition million dollars adds an undetectable amount of value. Well before reaching one trillion, in my opinion and for my lifestyle.


Best, Howard
 
Assuming living costs are not significantly different than they are today --- Why?? Money decreases in utility as the accumulation of it increases. After buying a reasonable, or even an excessive, house, what does the 10th million enable that the 9th did not? Or 110th and 109th. Or fill in your own numbers. But at some point one addition million dollars adds an undetectable amount of value. Well before reaching one trillion, in my opinion and for my lifestyle.

Yes... I rather be this guy.

 
Data Structures used in Machine Learning for Trading

There are many development platforms that support traditional trading systems, including TradeStation, AmiBroker, NinjaTrader, WealthLab, and dozens of others. There is not yet a trading system development platform that gives trading system developers who want to use impulse signals and multi-day holding access to a wide variety of model fitting techniques. The reason to look beyond the traditional platforms is that all of them are limited in model choice to decision trees, while other models may produce trading systems with better performance.

Fortunately, trading systems that use state signals and mark-to-market every bar are easily implemented. They fit directly into the machine learning techniques supported by Python and scikit-learn.

What follows is a simplified list of the steps of trading system development and trading management, annotated with short descriptions of the data structure and program associated with each step.

Development

... Acquire historical data. A temporary Pandas dataframe (similar to a two-dimensional array or spreadsheet -- references at the end of this post) will be used to receive the data from the vendor. Each day or bar of data creates an observation. Most likely your program will open and read data files from a data provider such as Yahoo, Google, or Quandl.

... Data examination and cleaning. The temporary Pandas dataframe. Programs that examine the data series looking for missing data, inconsistencies, outliers. Programs that plot the data giving an opportunity for visual inspection.

... Data consolidation. The main Pandas dataframe. Data from the individual streams and sources are combined into a single dataframe. Pandas performs date alignment and time zone adjustment automatically.

... Indicator computation. The main Pandas dataframe. Functions that compute indicators (such as RSI or detrended price oscillator) that will be used as predictors are computed using functions that operate on columns of the dataframe, creating additional columns. Previous values of indicators (lagged values) are copied into individual columns creating new indicators.

... Target computation. The main Pandas dataframe. The machine learning algorithm will make the best fit it can to the target variable, based on the predictor variables. There will be a target for every observation. The target has its own column in the dataframe.

... Data preparation for machine learning. Conversion from Pandas dataframe to numpy array (more basic two dimensional array). A single assignment statement performs the conversion to the data format that the scikit-learn models are programmed to expect.

... Hyperparameter determination (Model selection; period of stationarity; performance metrics; predictor variable selection; train/validate/test split). Hyperparameters are variables set before fitting the model to predict the target. In python, these are set using the crossvalidation libraries of scikit-learn, typically together with grid searches.

... Model fitting. Two numpy arrays are passed to the model fitting procedure -- a two dimensional array with columns of predictor variables and rows of observations; and a one dimensional array with an entry for each observation, holding the target value we want the model to learn. The scikit-learn model has been designed to implement a particular fitting technique, programmed, verified for correctness, and optimized for execution speed. The fitting process produces a storable model that can be used later to predict the target value for a given set of predictor variables.

... Prediction. The previously fitted and stored model, together with a two dimensional numpy array of predictor variables. The prediction process applies the model to the data and produces a one dimensional array with predicted target values -- one per observation.

... Model assessment. Two one dimensional arrays -- one of known target values, the second of predicted target values. Built-in assessment routines compare the two arrays and produce goodness-of-fit metrics. Alternatively, a custom program (perhaps written by you) evaluates the risk profile and profit potential of the predictions.

... Model storage. The model is essentially the coefficients to a set of simultaneous equations, together with the definition of the model. It is stored on disk for later retrieval and use.


Trading and trading management

... Acquire current data. Extend the Pandas dataframes used to acquire historical data. Examine the data, consolidate it, compute indicators, and prepare the predictor variable array expected by the model. The target is unknown, and no target array is prepared.

... Model retrieval. Retrieve the previously fitted model from disk.

... Prediction. The model processes the updated array of predictor variables and produces prediction for the new observations.

... Model assessment. Using the trading management model -- which has been prepared using a similar procedure to the trading model -- determine system health, estimate risk-normalized profit potential, determine maximum safe position size.

... Trade placement. Place an order.

-----------------------------

The new data structure is Pandas dataframe. Pandas was developed by Wes McKinney while he was an analyst at Cliff Asness' AQR Capital Management hedge fund. Wes is no longer at AQR, but continues to develop machine learning tools and techniques. His description of Pandas can be found in his book, "Python for Data Analysis." A second edition is due to be published in August 2017.

Jake VanderPlas, an astronomer active in the machine learning community, has published "Python Data Science Handbook" which explains Pandas dataframes and numpy arrays with excellent examples. I highly recommend Jake's book.

Sebastian Raschka, a PhD candidate in computer science at Michigan State University, has published several excellent books related to machine learning. Begin with "Python Machine Learning."

Best, Howard
 
Top