Australian (ASX) Stock Market Forum

Securities Price Tank - Storing CSV file Security Prices in Memory

Joined
12 February 2014
Posts
138
Reactions
25
In regards to programming languages like python and reading some of the posts on this forum, I take it people seem to be experiencing problems on how to access security prices from files and perform some kind of technical analysis function on them such as a simple moving average. Furthermore, it has been made apparent that programming languages like python have no easy way of performing simple technical analysis functions on a portfolio of securities. All these python packages and examples seem rather complex, hence making people put this kind of stuff in the too hard basket. That's not good for anyone and it's certainly not good for those who want to delve into areas of data science.

I've decided to bridge the gap using python c_types and a shared library. What I have done is written a very efficient shared library of c/c++ code that allows users to read csv files (following strict format) into memory and then in a python script (or other language) iterate through the securities fetching all the corresponding prices (e.g. close) in an array. This allows the user to concentrate on the technical analysis side of things like writing simple moving average code in python, without having to worry about how the input data is stored. Hence could lead to other more advanced programming ideas and interesting stuff depending on your knowledge.

Since this problem has become such a big issue, I've decided to release the code so that people can compile it and create the shared library on not just windows but other os platforms as well such as linux. Hopefully there are no errors, but code is only as good as the people who use and test it!

A picture says a thousand words so let's start with linux. Once you have extracted the attached zip file, just follow the walk-through in the following diagram:

01_01_SPT_LBuild.png

You will notice that after the shared object file is built, both python 2.7 and 3.4 complain that it can't find it. This is normal and can be easily rectified by editing the python script "spt.py" and adding the current working directory to "libSPT.so" in the LoadLibrary entry. To find the current working directory in linux, type "pwd" (shown at bottom of diagram). After that you should be able to successfully run the python script. If anyone has a more professional solution for loading .so's, please feel free to comment. This will do for now.

And now on to the more complicated Windows build. After installing both python and the mingw*-g++.exe compiler. To simplify things, find out where your g++ executable is located and follow the walkthrough in the following diagram for your windows os (Win32 or Win64). For Win64 build, you need to also find the file "libwinpthread-1.dll" located somewhere in your mingw installation and copy that file to where your extracted zip files are stored.

01_02_SPT_WBuild.png

After that, you can just type "python spt.py" and it should work. No need to hard code paths as windows python is smart enough to use the current working directory for locating the shared library (.dll).

The attached zip file contains the following files:

quotes.csv - sample file containing security prices (symbol,date,open,high,low,close,volume)
spt.cpp - c++ shared library code
spt.h - c++ declarations containing functions that interface calling application with shared library code
spt.py - python script that uses shared lib to process security prices from csv files such as sample file "quotes.csv"

Note: spt = securities price tank.

Cheers,

Andrew
 

Attachments

  • spt_20170414.zip
    742.7 KB · Views: 5
Here is what happens when you run the script:

02_01_SPT_LTR1.PNG

If you press enter, it will just iterate through the tank displaying what was fed into it.

So starting at the top, we display the following:

Total records inserted which is the number of successfully inserted rows from the csv file(/s). Any errors will be displayed.

Total securities

Max security records which is the maximum number of records any security has in the current pool of data (tank). This is an important number because when we fetch the price arrays for each security in memory, we need to dynamically allocate this amount of elements beforehand. Not doing this will cause a segmentation fault.

Database load time which is how long it took for the shared library to load the data from csv file into memory (tank).

After that we run 3 timed loops on iterating through all 2000 securities stored in memory (tank).

First we see how long it takes to iterate through securities while doing nothing 2000 times.

Next we time how long it takes to fetch all 2000 security prices.

Finally we time how long it takes to compute a simple moving average 2000 times.

Hopefully viewing the code in the python script will make things more clear. It can be made to do much more, but will serve as a good template for now :).
 
Here is a detailed description of the shared library interface functions:

SPT_RemoveAll(): Removes all securities from memory (empty tank)

SPT_SetStrictInsertDateOrder(order):

Only allows data to be inserted in ascending or descending order.
order = 0 is descending meaning newest to oldest dates:

BHP,2017/01/10,.... Ok
BHP,2017/01/09,.... Ok
BHP,2017/01/08,.... Ok
BHP,2017/01/11,.... Error

order = 1 is ascending meaning oldest to newest dates:

BHP,2017/01/08,.... Ok
BHP,2017/01/09,.... Ok
BHP,2017/01/10,.... Ok
BHP,2017/01/07,.... Error

We need to do this so that there are no duplicates and the order is of the utmost importance.

SPT_SetInsertDateConstraint(startyear,.....,endyear,.....):

Only insert between these dates inclusive.
Other dates are ignored.

SPT_Insert(...):

Insert record such as a row from a csv file into memory (tank).
This is not used in python script as it was found to be too slow for python to read data in from csv files.
Basic verification such as open > high or low <= 0 is also performed here and if there is an error, the record is rejected and a message is displayed.

SPT_CSVFileInsert(dfname, dfnameissymbol=0/1, hastime=0/1):

Insert prices from CSV file into memory (tank)
dfnameissymbol = 0 means data filename is not the symbol. We expect the following format from csv file:

SYMBOL,date[,time],open,high,low,close,volume

dfnameissymbol = 1 means data filename is the symbol. We expect the following format from csv file:

date[,time],open,high,low,close,volume

hastime = 0 means no time column. We expect the following format from csv file:

[symbol],date,open,high,low,close,volume

hastime = 1 means time column. We expect the following format from csv file:

[symbol],date,TIME,open,high,low,close,volume

IMPORTANT NOTE: I have put some commented out code in the python script that can be used to read in multiple files.

SPT_GetTotalSecurities(): Obvious

SPT_GetTotalRecordsInserted(): Obvious (1 record = 1 successfully inserted row from csv file).

SPT_GetSecurityMaxRecords():

Returns the maximum records (1 rec = 1x[date,prices,volume]) any one security currently holds.
We need this to dynamically allocate memory for arrays in calling application such as python.

SPT_First(): Start our iterations at the first security. Returns number of records used.

SPT_Next(): Iterate to next security in the list. Returns number of records used.

SPT_GetSymbol(symbol): Get the symbol name for the current security.

SPT_GetAllPrices(date,prices...,volume): Get the prices for the current security.

SPT_GetAllPricesRev(date,prices...,volume):

Get the prices for the current security in the reverse order.
This is useful for TA purposes as order is important.

Date from csv file can be in the following formats:

dd/mm/yyyy
yyyy/mm/dd
yyyymmdd

where "/" could be any other character such as "-" or "."

Time from csv file can be in the following formats:

hr:min
hr:min:sec
hr:min:sec.msec
int

If time is just an int with no separator, then that is what gets stored (user defined).
otherwise time gets converted to milliseconds in the day and stored as follows
time = hour * 3600000 + min * 60000 + sec * 1000 + msec

To convert this value back you would have to code something like:

hour = int(time / 3600000)
minute = int((time % 3600000) / 60000)
sec = int((time % 60000) / 1000)
msec = time % 1000

A typical date/time format is something like:
2017/04/10,10:12, or 10/04/2017T10:12:04.123, or 20170410:0940,

That is all for now, happy easter ;).

Cheers,

Andrew.
 
hi,

great effort. however why don't you just use the pandas library to do all this?
 
hi,

great effort. however why don't you just use the pandas library to do all this?
Yeah kinda was wondering the same. If all data is in csv its only a few lines of code to open them and put them into a dataframe.
 
For those that use Norgate Premium Data, there is a conversion tool in the downloads section that can convert the metastock data files to CSV format. You can choose how you want the data formatted.

It only took about two minutes to convert every ASX stock into CSV files, which I thought was pretty quick.
 
For those that use Norgate Premium Data, there is a conversion tool in the downloads section that can convert the metastock data files to CSV format. You can choose how you want the data formatted.

I think that only works for the old version of Premium Data not the latest NDU product.
 
I think that only works for the old version of Premium Data not the latest NDU product.

This is the first I've heard of a new product. It looks like the "old product" is still the current product unless you sign up for beta testing. Time will tell if they retain the ability to export CSV files.
 
The attached windows excel file shows how to access the windows "SPT.dll" functions in a way similar to the python script.

The Diagram below shows the results of clicking on the buttons "Test" and "Turnover Test".

03_01_SPT_XLS1.PNG

So basically the vba script provides the input filename "quotes.csv" to the SPT.dll (that you created in the first post). Data is read in to the tank and then we iterate through each security to process its data.

The Test Button Macro has the same functionality as the python script and

The Turnover Test Button Macro shows a list of all securities with dates where the 5 day average turnover is greater than $50 million.

To view the VBA code in windows excel, just press Alt-F11 and find the spt module.

Hopefully this .xls file should run on all versions of windows excel >= 2003. To run, simply make sure the file resides in the same location as the created "SPT.dll" file and file "quotes.csv".

Cheers,

Andrew.
 

Attachments

  • spt.xls
    96 KB · Views: 3
For the sake of completeness, thought I'd add how to handle multiple csv files.

The attached file sptmcsv20170805.zip contains the following files:

amp.csv - single stock quotes file.
anz.csv - single stock quotes file.
bhp.csv - single stock quotes file.
bxb.csv - single stock quotes file.
cba.csv - single stock quotes file.
csl.csv - single stock quotes file.
iag.csv - single stock quotes file.
mqg.csv - single stock quotes file.
nab.csv - single stock quotes file.
qbe.csv - single stock quotes file.

sptmcsv.py - python script modified to handle multiple csv files.
sptmcsv.xls - excel file/script modified to handle multiple csv files.

Both scripts require the shared lib that you compiled earlier.

Cheers,

Andrew.
 

Attachments

  • sptmcsv20170805.zip
    41.2 KB · Views: 3
I was able to split the quotes.csv file (in first post) into nearly 2000 symbol files and these have been zipped into the attached file mfquotes.zip. The csv file format is the same as csv files in previous post ("sptmcsv20170805.zip").

I have done this so similar comparisons can be made with file quotes.csv. Also I think this is the format Premium Data use when you want to export the data to csv. That is, the symbol is the filename and the format is "date,open,high,low,close,volume". I hope that's right cause that's why I added this feature:xyxthumbs.

So when we run "python3 sptmcsv.py" in the same directory as the 2000 files, we get

sptmcsv_result1.png

I notice the database load time here is about 5 times slower than when the script was run with single file "quotes.csv". This is probably because of the opening and closing of about 2000 files. Since there is only a maximum of 2 months worth of data for each security, I expect that the time difference would be a lot smaller for say 5 years worth of data. Overall, not too bad and all the other statistics are pretty much the same.

And now on to using just plain python (without shared libraries). I now want to answer the question "Why don't I just use pandas library to do all this?".

So what I plan to do using pandas is create a dictionary of dataframes where the key is the symbol and the value is the dataframe containing the price data and the input files follow the same 2 formats discussed earlier. We want to compare the iterating through all 2000 symbols with my shared lib solution. The main comparison is the Simple Moving Average. I don't want to use the built in libraries, because the programmer must be able to create their own indicators.

Stay tuned.

Andrew
 

Attachments

  • mfquotes.zip
    1 MB · Views: 1
Realised in my last post that the database load time was so high because I included the print statements in the timing. Having commented the print statements out and a couple of other useless variables, I only got it down to 0.6 seconds which was roughly 3.5 times slower. I am now blaming doing the string manipulations (such as create_string_buffer and os.path.join) 2000 times for the performance difference.

I have now attached file "sptdf20170820.zip" which contains python only file "sptdf.py" and is to be extracted in the same directory as the csv file(/s). If you run the script and get an error like this:

Traceback (most recent call last):
File "sptdf.py", line 11, in <module>
import pandas as pd
ImportError: No module named pandas

it means you don't have the pandas package installed and need to do so for script to work correctly.

On linux, I successfully executed the following:

$ sudo apt-get install python-pandas (for python3 pandas put python3-pandas)

On Windows, I did the following:

E:\>pip install --upgrade pip
Requirement already up-to-date: pip in c:\python34\lib\site-packages
E:\>pip install pandas
Collecting pandas
Downloading pandas-0.20.3-cp34-cp34m-win_amd64.whl (8.1MB)
100% |################################| 8.1MB 95kB/s
Collecting pytz>=2011k (from pandas)
Downloading pytz-2017.2-py2.py3-none-any.whl (484kB)
100% |################################| 491kB 398kB/s
Collecting numpy>=1.7.0 (from pandas)
Downloading numpy-1.13.1-cp34-none-win_amd64.whl (7.6MB)
100% |################################| 7.6MB 101kB/s
Collecting python-dateutil>=2 (from pandas)
Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
100% |################################| 194kB 443kB/s
Collecting six>=1.5 (from python-dateutil>=2->pandas)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: pytz, numpy, six, python-dateutil, pandas
Successfully installed numpy-1.13.1 pandas-0.20.3 python-dateutil-2.6.1 pytz-201
7.2 six-1.10.0

Of course if you have Anaconda you hopefully don't need to do any of this!

Now to avoid creating 2 scripts, "sptdf.py" checks if file "quotes.csv" exists.
If it does we use the single file solution else we use the multiple files solution.

Here are the results when I run the script "sptdf.py" (pure python) on linux:

sptdfcmp1.png

Database load time (3 seconds) is actually pretty good for python. I had previously tried the read and split approach and the plain csv package but they both took almost 2 minutes. So using pands.read_csv is definately the way to go for python. On a database with a much larger history it's about 4.5 times slower (even better).

I simulated fetching symbols with prices by copying each dataframe iterated over which is a fair thing to do.
Although 10 times slower, on a database with a much larger history, it's about 2 times slower which is excellent.

Calculating a simple moving average for about 2000 securities takes about 2 seconds, which is about 40 times slower than my shared lib solution. On a database with a much larger history, it's about 20 times slower, but we are only using a 5 period simple moving average. This is quite terrible because I use the same python moving average function for my shared lib solution. Probably need to also test more complicated indicators on a more historical database to get a better view on performance.

I guess now, the challenge is to first code a more efficient simple moving average that takes far less than 2 seconds on current small database. Can anyone else think of a better way to do this? The solution must be coded in a way that the same coding can be applied to more complicated indicators such as ADX or Ichimoko. We don't want to use built in libraries because we must be able to code what ever custom indicator we want, hence the need to access individual elements and not whole vectors/arrays.

At the moment my answer is Pandas too slow and needs full vector operations for good script performance, but hey its early days.

Cheers,

Andrew.
 

Attachments

  • sptdf20170820.zip
    2.4 KB · Views: 3
I have now updated the script with some much needed changes/additions.

Firstly, the Date column is now the index in the dataframe and dates are sorted from oldest to most current. If you get an error message like:

ValueError: time data '2017-03-23' does not match format '%d/%m/%Y'

Then it means the date format on your data set does not match what is expected by the script and you may need to change the following function:

def date_parser(date_string):
____date = pd.to_datetime(date_string, format="%d/%m/%Y") # dd/mm/yyyy
____#date = pd.to_datetime(date_string, format="%d-%m-%Y") # dd-mm-yyyy
____#date = pd.to_datetime(date_string, format="%Y/%m/%d") # yyyy/mm/dd
____#date = pd.to_datetime(date_string, format="%Y-%m-%d") # yyyy-mm-dd
____return date

As you can see, I have it set to dd/mm/yyyy for the purpose of my test data sets posted above.

Alternatively in function call pd.read_csv(...parse_dates=[DATEHEADER], date_parser=date_parser), you can replace entry "date_parser=date_parser" with the generic parser like so:

pd.read_csv(...parse_dates=[DATEHEADER], infer_datetime_format=True)

and this will attempt to cleverly parse the date format no matter what is thrown at it.

The generic parser works with my generated "quotes.csv" file from my yahoo download thread where the format is yyyy-mm-dd, but when applied to my test data sets posted above where the format is dd/mm/yyyy, the generic version treats it as mm/dd/yyyy which I believe is a us format. So always pays to check generic parser is working correctly when taking the lazy approach to parsing dates.

Secondly, the following common EOD trading TA functions have been added to the script to make things more interesting (and slower):

SMA(df, column_pos, n, sh=0): #Simple Moving Average in n periods shifted/looking sh periods back
EMA(df, column_pos, n, sh=0): #Exponential Moving Average in n periods shifted/looking sh periods back
HHV(df, column_pos, n, sh=0): #Highest High Value in n periods shifted/looking sh periods back (upper band of Donchian channel if sh=1)
LLV(df, column_pos, n, sh=0): #Lowest Low Value in n periods shifted/looking sh periods back (lower band of Donchian channel if sh=1)
CUpDown(df, column_pos, sh): #Number of Consectutive periods up(>0), down(<0) or if breakeven, set to 0.
KER(df, column_pos, n=20): #Kaufman's Efficiency Ratio
MOM(df, column_pos, n): #Momentum
ROC(df, column_pos, n): #Rate of Change
StdDev(df, column_pos, n): #Standard Deviation
BBANDS(df, column_pos, n, sd): #Bollinger Bands
MACD(df, n_fast=12, n_slow=26, n_sign=9): #MACD, MACD Signal and MACD Hist
CCI(df, n=20): #Commodity Channel Index
STO(df, fsn=14, ssn=3): #Stochastic Oscillator
RSI(df, n=14): #Relative Strength Index
TO(df): #Turnover (close x volume)
AvgTO(df,n): #AverageTurnover (average(close) x average(volume))
OBV(df): #On-Balance Volume
AccDist(df): #Accumulation Distribution
ChaOsc(df, n1=3, n2=10): #Chaikin Oscillator (includes Accumulation Distribution)
MFI(df, n=14): #Money Flow Index (aka volume-weighted RSI)
ATR(df, n=14): #Average True Range
ATRW(df, n=14): #Average True Range Welles Wilder
ADX(df, n=14): #Average Directional Movement Index
PSAR(df, afi=0.01, afm=0.2): #Parabolic Sar
Aroon(df, n=25): #Aroon

Note: There are some Deprecated versions (Prefixed "Depr") of some of the above functions for old linux installations or python versions. You may need to use these if you get an error message like:

"AttributeError: 'Series' object has no attribute 'rolling'" or "AttributeError: 'Series' object has no attribute 'ewm'"

All these TA functions require a dataframe passed in. Calculations are performed on the dataframe columns and one or more columns containing the calculations are then appended to the dataframe and the modified dataframe is returned. So a simple moving average of the close will add an additional column like so:

__________Open____High_____Low______Close_____Volume_SMA(5,0)_3
Date
2012-04-02 23.500000 23.590000 23.049999 23.059999 8970121 NaN
2012-04-03 23.070000 23.209999 22.910000 22.990000 7933211 NaN
2012-04-04 22.799999 23.049999 22.790001 23.010000 7213844 NaN
2012-04-05 22.950001 23.080000 22.770000 23.040001 5952048 NaN
2012-04-10 22.799999 23.059999 22.799999 22.980000 6560878 23.016000
2012-04-11 22.799999 22.850000 22.690001 22.770000 7470594 22.958000
2012-04-12 22.940001 22.980000 22.750000 22.980000 4699164 22.956000
2012-04-13 23.180000 23.200001 22.870001 22.980000 5641305 22.950000

So the last column is a 5 period SMA shifted 0 periods applied to column position 3, which is the close (Open=0,High=1,Low=2,Close=3,Volume=4).
It was necessary to name the column like this in case we want to apply 2 SMA's and avoid error messages like:

"ValueError: columns overlap but no suffix specified"

This happens when 2 or more columns share the same name. Can be very frustrating the first time you see this message.

You will notice when data is read from pd.read_csv(), I have kept all floating point dataframe columns to type "PRICETYPE". Where global variable PRICETYPE is defined at the top of the script as PRICETYPE = np.float64. I have also done this in my TA functions using astype or dtype. So if you want to save on "memory", you could change PRICETYPE to something like np.float32. One little problem with setting to less than np.float64 is that even though the data read in from pd.read_csv() is set to this type, all pandas calculations seem to be performed in np.float64. This is so annoying because converting a 32 bit closing price to a 64 bit closing price introduces a very small floating point error which can cause problems when lots of calculations are performed. 32 bit floating point calculations are sufficient for financial analysis. You don't need to waste space and time doing 64 bit calculations. I don't know why pandas does this!!!

I have attached file sptdf20180809.zip which contains the updated script file sptdf.py. When executed, it does the following:

1) iterates through all securities stored in memory read in from pd.read_csv using test file data in above posts
2) performs calculations "SMA, ATRW, ADX and AROON" (may need to put on a cuppa while you wait!)
3) displays a dataframe of results for ANZ between 2 dates and also store the results in test.txt and test.html using df.to_csv() and df.to_html() respectively.

As you can see the main code isn't too much. It's when we have to iterate through a dataframe that takes a lot of code.

So what I have covered so far is:

1) Read financial securities data in from either a single csv file containing columns security symbol,date,open,high,low,close,volume or multiple csv files where file name is the security symbol containing columns date,open,high,low,close and volume.
2) Created a dictionary of dataframes storing all the financial data (read in from file/(s)) in memory.
3) Created common EOD technical analysis functions to perform calculations on a security's dataframe.
4) Iterate through dataframes doing simple stuff which could lead to amazing stuff provided we ignore performance LOL!
 

Attachments

  • sptdf20180809.zip
    7.8 KB · Views: 1
Top