Class08 Answer:

https://github.com/danbikle/rollup

Operate the scripts in the above repo.

I started by using my browser to get 5 large tar files from the URL listed below:

https://drive.google.com/drive/folders/1jMFSP-_wTtt5rtdtQuQdtPQr6087xz6K

The 5 files are large, so the download might take an hour or two.

The names are listed below:

audusd.tar
eurusd.tar
gbpusd.tar
usdcad.tar
usdjpy.tar

Each tar file contains many zip files.

Each zip file contains many observations (in CSV format).

Each observation is a subsecond sampling of Forex prices.

After you get the above tar files, just do what I did.


cd ~
git clone https://github.com/danbikle/rollup
mkdir -p ~/rollup/data/forex_tarfiles/

At this point I had my 5 tar files in my Downloads folder.

I moved them to the right place:

cd ~/Downloads/
mv audusd.tar ~/rollup/data/forex_tarfiles/
mv eurusd.tar ~/rollup/data/forex_tarfiles/
mv gbpusd.tar ~/rollup/data/forex_tarfiles/
mv usdcad.tar ~/rollup/data/forex_tarfiles/
mv usdjpy.tar ~/rollup/data/forex_tarfiles/

Next, I un-tared the zip files from the tar files into a folder called: 'zip':

mkdir -p ~/rollup/data/zip
cd       ~/rollup/data/zip
tar xf ~/rollup/data/forex_tarfiles/audusd.tar
tar xf ~/rollup/data/forex_tarfiles/eurusd.tar
tar xf ~/rollup/data/forex_tarfiles/gbpusd.tar
tar xf ~/rollup/data/forex_tarfiles/usdcad.tar
tar xf ~/rollup/data/forex_tarfiles/usdjpy.tar

On your laptop, the above commands might take 20 or 30 minutes.

When I ran the above commands on my laptop, the zip folder had observations going back to 2010.

I used a few lines of simple Python-Pandas calls to inspect the first zip file:


dan@h79:~/rollup/data $ cd ..
dan@h79:~/rollup $ cd python/
dan@h79:~/rollup/python $ python
Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> my_df = pd.read_csv('../data/zip/AUDUSD-2010-01.zip')
>>> my_df.head()
   AUD/USD  20100103 21:28:00.773  0.89808  0.89832
0  AUD/USD  20100103 21:28:16.897  0.89806  0.89836
1  AUD/USD  20100103 21:30:36.304  0.89770  0.89830
2  AUD/USD  20100103 21:30:36.550  0.89766  0.89804
3  AUD/USD  20100103 21:30:40.813  0.89801  0.89832
4  AUD/USD  20100103 21:31:06.413  0.89803  0.89834
>>>
>>>
>>>

It appears the data has no column headings.

I used a Pandas parameter named: 'names' to add my headings:


>>> my_df = pd.read_csv('../data/zip/AUDUSD-2010-01.zip',names=['pair','ts','bid','ask'])
>>> my_df.head()
      pair                     ts      bid      ask
0  AUD/USD  20100103 21:28:00.773  0.89808  0.89832
1  AUD/USD  20100103 21:28:16.897  0.89806  0.89836
2  AUD/USD  20100103 21:30:36.304  0.89770  0.89830
3  AUD/USD  20100103 21:30:36.550  0.89766  0.89804
4  AUD/USD  20100103 21:30:40.813  0.89801  0.89832
>>>
>>>
>>>

You should try the above Python-Pandas calls on your laptop.

If you see something similar to what I display above, then you are on the right track.

Next, run the script listed below which should roll-up prices from micro-second observations to one-second observations.


cd ~/rollup/python
~/anaconda3/bin/python rollup1s.py

The script should write the rolled-up prices to this folder:

~/rollup/data/csv1s/

I ran the above script on my laptop and it needed 92 minutes to complete.

If you are in a hurry, you can skip the above script.

Just download the rolled-up prices, csv1s.csv, from the URL listed below:

https://drive.google.com/open?id=19ZMMRnmi1Wm9rubJ_XqexSsagTBSEY0J

I used three shell commands to inspect the output:


dan@h79:~/rollup $ du -sh ~/rollup/data/csv1s # How large is the output?
1.9G	/home/dan/rollup/data/csv1s
dan@h79:~/rollup $ 
dan@h79:~/rollup $


dan@h79:~/rollup $ ls -la ~/rollup/data/csv1s/ | head
total 1963144
drwxrwxr-x 2 dan dan    20480 Nov 12 18:06 .
drwxrwxr-x 7 dan dan     4096 Nov 12 21:53 ..
-rw-rw-r-- 1 dan dan  2871751 Nov 12 16:34 AUDUSD-2010-01.csv.bz2
-rw-rw-r-- 1 dan dan  2348084 Nov 12 16:34 AUDUSD-2010-02.csv.bz2
-rw-rw-r-- 1 dan dan  3167067 Nov 12 16:34 AUDUSD-2010-03.csv.bz2
-rw-rw-r-- 1 dan dan  2814488 Nov 12 16:35 AUDUSD-2010-04.csv.bz2
-rw-rw-r-- 1 dan dan  4727926 Nov 12 16:35 AUDUSD-2010-05.csv.bz2
-rw-rw-r-- 1 dan dan  4113280 Nov 12 16:35 AUDUSD-2010-06.csv.bz2
-rw-rw-r-- 1 dan dan  3817810 Nov 12 16:35 AUDUSD-2010-07.csv.bz2
dan@h79:~/rollup $ 
dan@h79:~/rollup $ 


dan@h79:~/rollup $ ls -la ~/rollup/data/csv1s/ | tail
-rw-rw-r-- 1 dan dan 10035652 Nov 12 18:03 USDJPY-2017-01.csv.bz2
-rw-rw-r-- 1 dan dan  8077095 Nov 12 18:03 USDJPY-2017-02.csv.bz2
-rw-rw-r-- 1 dan dan  9463666 Nov 12 18:04 USDJPY-2017-03.csv.bz2
-rw-rw-r-- 1 dan dan  8050948 Nov 12 18:04 USDJPY-2017-04.csv.bz2
-rw-rw-r-- 1 dan dan  9262043 Nov 12 18:05 USDJPY-2017-05.csv.bz2
-rw-rw-r-- 1 dan dan  9079336 Nov 12 18:05 USDJPY-2017-06.csv.bz2
-rw-rw-r-- 1 dan dan  5526826 Nov 12 18:05 USDJPY-2017-07.csv.bz2
-rw-rw-r-- 1 dan dan  5930854 Nov 12 18:06 USDJPY-2017-08.csv.bz2
-rw-rw-r-- 1 dan dan  5348660 Nov 12 18:06 USDJPY-2017-09.csv.bz2
-rw-rw-r-- 1 dan dan  5250981 Nov 12 18:06 USDJPY-2017-10.csv.bz2
dan@h79:~/rollup $ 
dan@h79:~/rollup $

If you got this far on your laptop, you should keep going.

I see the files in csv1s to be the first stage of a 3 stage rollup.

To start on the second stage, inpsect this script:

https://github.com/danbikle/rollup/blob/master/python/rollup1m.py

Next, run the above script with these shell commands:


cd ~/rollup/python
~/anaconda3/bin/python rollup1m.py

I ran the above script and saw some output which is displayed below:


dan@h79:~ $ 
dan@h79:~ $ cd ~/rollup/python
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ ~/anaconda3/bin/python rollup1m.py
Wrote:  ../data/csv1m/AUDUSD-2010-01.csv.bz2
Wrote:  ../data/csv1m/AUDUSD-2010-02.csv.bz2
Wrote:  ../data/csv1m/AUDUSD-2010-03.csv.bz2
Wrote:  ../data/csv1m/AUDUSD-2010-04.csv.bz2

SNIP ...

Wrote:  ../data/csv1m/USDJPY-2017-07.csv.bz2
Wrote:  ../data/csv1m/USDJPY-2017-08.csv.bz2
Wrote:  ../data/csv1m/USDJPY-2017-09.csv.bz2
Wrote:  ../data/csv1m/USDJPY-2017-10.csv.bz2
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 

The above script needed 15 minutes to run on my laptop.

If you are in a hurry, you can skip the above script.

Just download the rolled-up prices, csv1m.csv, from the URL listed below:

https://drive.google.com/open?id=12rFRHWCtqI_taRkyVk-Vw0G70NkZwvAg

The output folder was smaller than the input folder:


dan@h79:~/rollup/python $ cd ~/rollup/data/
dan@h79:~/rollup/data $ 
dan@h79:~/rollup/data $ du -sh csv1s csv1m
1.9G	csv1s
95M	csv1m
dan@h79:~/rollup/data $ 
dan@h79:~/rollup/data $

Using the bzip2 shell command, I inspected the first and last files in the output folder:


dan@h79:~/rollup/data $ bzip2 -cd csv1m/AUDUSD-2010-01.csv.bz2 | head
ts1m,ask,bid
20100103 21:28,0.898340,0.898070
20100103 21:30,0.898245,0.897845
20100103 21:31,0.898340,0.898030
20100103 21:32,0.898275,0.897310
20100103 21:34,0.898290,0.897690
20100103 21:36,0.898737,0.897947
20100103 21:37,0.898855,0.898205
20100103 21:38,0.898670,0.898240
20100103 21:39,0.898710,0.898225
dan@h79:~/rollup/data $
dan@h79:~/rollup/data $


dan@h79:~/rollup/data $ bzip2 -cd csv1m/USDJPY-2017-10.csv.bz2 | head
ts1m,ask,bid
20171001 21:03,112.620443,112.466786
20171001 21:04,112.617667,112.486333
20171001 21:05,112.573250,112.497125
20171001 21:06,112.574225,112.484707
20171001 21:07,112.566472,112.486515
20171001 21:08,112.592050,112.461283
20171001 21:09,112.587962,112.456269
20171001 21:10,112.533788,112.492685
20171001 21:11,112.547400,112.491900
dan@h79:~/rollup/data $ 
dan@h79:~/rollup/data $

If you got this far on your laptop, you should keep going

I see the files in csv1m to be the second stage of a 3 stage rollup

To start on the third stage, inspect this script:

https://github.com/danbikle/rollup/blob/master/python/rollup5m.py

Next, run the above script with these shell commands:


cd ~/rollup/python
~/anaconda3/bin/python rollup5m.py

I ran the above script and saw some output which is displayed below:


dan@h79:~ $ 
dan@h79:~ $ cd ~/rollup/python
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 


dan@h79:~/rollup/python $ ~/anaconda3/bin/python rollup5m.py
Busy writing csv data to ../data/csv5m
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 


dan@h79:~/rollup/python $ ls -la ../data/csv5m/
total 164928
drwxrwxr-x 2 dan dan    24576 Nov 12 23:14 ./
drwxrwxr-x 7 dan dan     4096 Nov 12 21:53 ../
-rw-rw-r-- 1 dan dan 34238569 Nov 13 16:16 AUDUSD.csv
-rw-rw-r-- 1 dan dan 32808984 Nov 13 16:17 EURUSD.csv
-rw-rw-r-- 1 dan dan 32813856 Nov 13 16:17 GBPUSD.csv
-rw-rw-r-- 1 dan dan 32598386 Nov 13 16:18 USDCAD.csv
-rw-rw-r-- 1 dan dan 36373896 Nov 13 16:18 USDJPY.csv
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 

Notice that rollup5m.py had a duration of about 4 minutes.

Also notice that the output files are not compressed:


dan@h79:~/rollup/python $ du -sh ../data/csv5m/*csv
33M	../data/csv5m/AUDUSD.csv
32M	../data/csv5m/EURUSD.csv
32M	../data/csv5m/GBPUSD.csv
32M	../data/csv5m/USDCAD.csv
35M	../data/csv5m/USDJPY.csv
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $ 


dan@h79:~/rollup/python $ head ../data/csv5m/*csv
==> ../data/csv5m/AUDUSD.csv <==
ts5m,ask,bid
1262582700,0.898340,0.898070
1262583000,0.898287,0.897719
1262583300,0.898743,0.898154
1262583600,0.898810,0.898414
1262583900,0.898893,0.898490
1262584200,0.898905,0.898445
1262584500,0.898817,0.898367
1262584800,0.898789,0.898437
1262585100,0.898852,0.898521

==> ../data/csv5m/EURUSD.csv <==
ts5m,ask,bid
1262582700,1.431004,1.430730
1262583000,1.430770,1.430507
1262583300,1.430667,1.430382
1262583600,1.430913,1.430610
1262583900,1.430832,1.430522
1262584200,1.430533,1.430124
1262584500,1.430241,1.429957
1262584800,1.430351,1.430098
1262585100,1.430225,1.429979

==> ../data/csv5m/GBPUSD.csv <==
ts5m,ask,bid
1262582700,1.613610,1.612887
1262583000,1.612035,1.611283
1262583300,1.611810,1.610866
1262583600,1.611708,1.611048
1262583900,1.611226,1.610816
1262584200,1.611391,1.610671
1262584500,1.611557,1.610947
1262584800,1.611461,1.610957
1262585100,1.611229,1.610783

==> ../data/csv5m/USDCAD.csv <==
ts5m,ask,bid
1262583000,1.052640,1.051770
1262583600,1.052650,1.051780
1262583900,1.052625,1.051763
1262584800,1.052540,1.051710
1262585100,1.052320,1.051670
1262585400,1.051597,1.051092
1262585700,1.051200,1.050860
1262586000,1.050981,1.050466
1262586300,1.050900,1.050248

==> ../data/csv5m/USDJPY.csv <==
ts5m,ask,bid
1262582700,93.002000,92.979500
1262583000,93.011806,92.979472
1262583300,93.001834,92.970416
1262583600,93.008867,92.967100
1262583900,93.000000,92.967000
1262584200,92.997900,92.963178
1262584500,93.015885,92.984903
1262584800,93.001022,92.977433
1262585100,93.057137,93.033794
dan@h79:~/rollup/python $ 
dan@h79:~/rollup/python $

So, that completes this lab which shows how to 'roll-up' sub-second Forex data into five-minute observations of prices.

These five-minute observations of prices are well suited for building machine learning models which predict future prices at these durations:

Class08 Lab (Pandas Staged Rollup)


learn4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen