Reformatting CSV (Comma Separated Value) files using Perl

ajcode · 5 December 2014

I've been working on a perl script that reformats csv (comma seperated value) files to a format specified by the user. Ever downloaded or used a program to generate csv files that are not quite the format you wanted. Well now you can correct these files using my perl script. Obviously you need perl installed on your computer and hopefully it works with 64 bit installations

So basically what my perl script does is process all csv files (could be only 1) in the current directory where the perl script resides and outputs the resultant csv files in a sub-directory called tmp (if the directory does not exist, it will be created). What the user needs to do is edit the script to either add, delete, modify, and/or rearrange the columns they want written. The limitation is that the original input csv files must be of the same format.

So when the script is run, the following questions will be asked:

1) Output to only 1 file (y/n) ?

If "y" then all results will be stored in a single file called csvref.csv in sub-directory tmp.
Otherwise the resultant files will have the same name as their original counterparts but stored in sub-directory tmp.

2) Write New Header (y/n) ?

If "y" then a header will be written at the start of the resultant file/(s). The user will need to edit subroutine writenewheader() to correct the resultant column headers. Otherwise no header is written to the resultant file/(s).

3) Ignore First N Lines (0 = include all) :->

When processing the input csv file/(s), if the first row contains just headers then you want to skip this row so enter "1", if the second row is garbage aswell then enter "2" and so on. If no headers or garbage then set to 0. So basically N entered rows will be skipped. All input csv files must have the same number of rows to skip over or resultant data will contain misssing or garbage rows which is pretty self explanatory!

I have attached perl script csvref.pl and test input files ANZ.csv, BHP.csv & CBA.csv and will explain the code that user needs to modify in later posts. These files all need to be in the same directory for things to work.

Note: Please remove .txt extension from all files.

Cheers,

Andrew

ajcode · 5 December 2014

So when you download the files (with txt extension removed), put them in the same directory like so:

Here is a sample test run

and the resultant file structure should look like this

Have a play round with the program options and check file contents to get a feel for things, then move on to my next post. Arguments can also be passed in from the command line like so: "perl csvref.pl n y 1". I have also tested this on linux mint 13

Note: If you put garbage in, you will get garbage out. Setting "Ignore First N Lines" to "0" in this case will produce a garbage header.

ajcode · 5 December 2014

So how do we update the script to get the desired output files? The answer lies at the bottom of the code below the comment "# User Modifiable Subroutines" in csvref.pl. There are 3 subroutines that need to be modified which are the following:

1

writenewheader();
2

processline($line, $file1prefix);
3

setdatetimestring($dts);

Before editing these routines, the first thing that needs to be done is to record the column format of the input csv files. Lets use my example files from first post. So in ANZ.csv we have:

<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>

Note: if there is no header then it's still a good idea to record what your looking at.

And say we want to display the result output as

Ticker,Date,Open,High,Low,Close,Volume

Lets go through the code changes in the above order:

1) In subroutine writenewheader(), we need to simply change the print statement to the following:

print FILE2 "Ticker,Date,Open,High,Low,Close,Volume";

FILE2 is just an output file handle for writing the above line.
If you enter "y" to "Write New Header (y/n) ?" when running the script, the above header will be written to the beginning of the resultant output file/(s).

2) processline($line, $fileprefix) takes 2 arguments, $line which is a variable that contains the line (or row) from the input file and $fileprefix which contains the input file name (without the extension) from which the line/row comes from. The only time we need to use the $fileprefix is if we want to store this as a column like in this example.

The $line variable contains all the column values seperated by the delimiter "," and the code in the while statement stores these column values in array @tmp. $tmp[0] contains the first column value, $tmp[1] contains the second column value and so on. So in the above example we have:

$tmp[0] = value of column <Per>
$tmp[1] = value of column <DTYYYYMMDD>
$tmp[2] = value of column <TIME>
$tmp[3] = value of column <OPEN>
$tmp[4] = value of column <HIGH>
$tmp[5] = value of column <LOW>
$tmp[6] = value of column <CLOSE>
$tmp[7] = value of column <VOL>
$tmp[8] = value of column <OPENINT>

Now the columns we want displayed are "Ticker,Date,Open,High,Low,Close,Volume". Since the Ticker column is not found here we need to generate it some how (A special case). Fortunately the Ticker just happens to be the input file name (2nd argument passed in), so we can use variable $fileprefix. All we need to change in the code now is the print statement so the following should do the trick:

print FILE2 "$fileprefix,$tmp[1],$tmp[3],$tmp[4],$tmp[5],$tmp[6],$tmp[7]";

This will display output values like so:

ANZ,20141017,31.84,32.12,31.78,31.93,7450100

Note that the second value here is "20141017" (from $tmp[1]) and when you run the script you get "17/10/2014". This is because I wanted to change the way the date was displayed. The line "$datetimestr = setdatetimestring($tmp[1])" does the change. So variable $datetimestr contains the newly formated date value and subroutine setdatetimestring() is discussed in the next section. Once you are happy with the newly created date column we can update the print statement to:

print FILE2 "$fileprefix,$datetimestr,$tmp[3],$tmp[4],$tmp[5],$tmp[6],$tmp[7]";

3) setdatetimestring() is probably the most difficult code change to make because you have to seperate all the small things that make up the date and/or time and then merge them back together again. This function takes one argument which is the original date/time value from the input file. This argument is stored as $_[0]. As there are many combinations for date format values, I shall only explain the current code so you get an idea of what to do.

So the original date format looks something like "20141017".

The code $yyyy = substr($_[0],0,4) takes the first 4 characters of the original date starting at the first character (position 0) and stores the result in variable $yyyy.

The code $mm = substr($_[0],4,2) takes 2 characters of the original date starting at the 5th character (position 4) and stores the result in variable $mm.

The code $dd = substr($_[0],6,2) takes 2 characters of the original date starting at the 7th character (position 6) and stores the result in variable $dd.

We then merge the variables into one with the following code "$datetimestr = "$dd/$mm/$yyyy" and return this value to the calling subroutine.

Hopefully this helps clear up the code changes people need to make and here ends the Lesson

Andrew

Tito006 · 28 October 2015

Thanks for that, Andrew!

Now I can see decent graphics for UK stock in my MT4 platform

Yahoo's data are just awful, and the only place where I found adequate data (Barchart Trade free trial) provides very confusing and unusable csv files:

But your script allows me to adjust the files so easily!
Now see the difference:

And it was just one example...

Much obliged! :xyxthumbs

ajcode · 29 October 2015

Great work Tito006. Well illustrated diagrams of yahoo and clean data.

Probably win the Stockies award for this

At Aussie, "We'll save you"

Cheers,

Andrew.

Reformatting CSV (Comma Separated Value) files using Perl

ajcode

Attachments

ajcode

ajcode

Tito006

ajcode

Similar threads

Connect with us