Class09 Answer:

Generate Models/Predictions from Features

When I started this lab I wanted to build Logistic Regression models for all available observations.

After experimentation, I found that this idea was not feasible; it takes too long.

So I wrote the script below which builds models for a random sample of observations:


# learn_tst_rpt_random.py

# This script should learn from these files:
# ../csv/featAUDUSD.csv
# ../csv/featEURUSD.csv
# ../csv/featGBPUSD.csv
# ../csv/featUSDCAD.csv
# ../csv/featUSDJPY.csv
# Then it should generate predictions and write to these files:
# ../csv/predictions_AUDUSD...
# ../csv/predictions_EURUSD...
# ../csv/predictions_GBPUSD...
# ../csv/predictions_USDCAD...
# ../csv/predictions_USDJPY...

import pandas as pd
import numpy  as np
from sklearn import linear_model
import os,glob,random,datetime,pdb

# I should remove unneeded files:
cmd1_s = "rm -f ../csv/predictions_*.csv"
os.system(cmd1_s)

# I should use a nested loop to generate predictions from a window which slides over several pairs.
# Sometimes I call this window the test-window because it should contain test-data.

# I should describe a window which slides from bottom of DF to top.
# The window should make jumps rather than sliding one row at a time.
# The window width should be same width as DF.
# The window length should be wlen_i.
# jump size should be jump_i.
# The number of jumps should be jumpc_i.

wlen_i = 20
jump_i = wlen_i # Avoids prediction 'overlap'
for trainsize_i in range(9000, 109000, 2000):
  print('Busy ooooooooooooooooooooooooooo')
  # I should define the number of observations I hold a pair after I buy/sell it.
  # Observations are separated by 5 min. One hour is 12 observations:
  duration_i = 12 # Hold for 1 hour then act on next prediction.
  pairs_l    = ['AUDUSD','EURUSD','GBPUSD','USDCAD','USDJPY']
  for pair_s in pairs_l:
    p0_df      = pd.read_csv("../csv/feat"+pair_s+".csv")
    # I should control how many times I jump the window.
    # If each jump is small, I can make more jumps:
    ### jumpc_i    = int((len(p0_df)-trainsize_i-100) / jump_i)-1
    # Above expression keeps my jumps inside of p0_df.
    # Below I hard-code jumpc_i to integer which works well with random sample idea:
    jumpc_i = 300
    for cnt_i in range(jumpc_i,0,-1):
      # I should build a model here from a random sample of data rather than all data:
      test_end_i    = random.randrange(trainsize_i+10*duration_i, (len(p0_df)-duration_i))
      test_start_i  = test_end_i-wlen_i
      train_end_i   = test_start_i - 2*duration_i # Avoid overlap
      train_start_i = train_end_i - trainsize_i
      train_df      = p0_df[train_start_i:train_end_i]
      train_df.to_csv('/tmp/train_df.csv', float_format='%4.4f')
      test_df       = p0_df[test_start_i:test_end_i]
      logr_model    = linear_model.LogisticRegression()
      xtrain_a      = np.array(train_df)[:,3:]
      xtest_a       = np.array(test_df)[:,3:]
      ytrain_sr     = train_df.piplead
      class_train_a = (ytrain_sr > 0.0)
      logr_model.fit(xtrain_a, class_train_a)
      # I should predict
      predictions_df = test_df.copy()[['ts','cp','piplead']]
      predictions_df['prediction'] = logr_model.predict_proba(xtest_a)[:,1].tolist()
      predictions_df['eff'] = np.sign(predictions_df.prediction - 0.5) * predictions_df.piplead
      predictions_df['acc'] = (predictions_df.eff > 0)
      fn_s = "../csv/predictions_"+pair_s+str(1000+cnt_i)+".csv" 
      predictions_df.to_csv(fn_s, float_format='%4.4f', index=False)
  eff_sum_f = 0
  for pair_s in pairs_l:
    fn_l = glob.glob("../csv/predictions_"+pair_s+"*.csv")
    # For this pair I should sort and make uniq and output to single file
    # inspiration:
    # sort -u ../csv/predictions_AUDUSD*.csv|grep 0> ../csv/predictionsAUDUSD.csv
    if len(fn_l) > 0 :
      cmd0_s = "sort -u ../csv/predictions_"+pair_s+"*.csv|grep 0 > "
      fn_s   = "../csv/predictions"+str(trainsize_i)+pair_s+".csv"
      os.system(cmd0_s + fn_s)
      p0_df = pd.read_csv(fn_s,names=['ts','cp','piplead','prediction','eff','acc'])
      print(pair_s+" Effectiveness:")
      eff_pair = np.sum(p0_df.eff)
      print(eff_pair)
      eff_sum_f = eff_sum_f + eff_pair
      print(pair_s+" Accuracy:")
      print(str(100 * np.sum(p0_df.acc) / len(p0_df.acc))+' %')
  print('trainsize_i:')
  print(trainsize_i)
  print('eff_sum_f:')
  print(eff_sum_f)
  # I should remove unneeded files:
  cmd1_s = "rm -f ../csv/predictions_*.csv"
  os.system(cmd1_s)

'bye'

In November 2017, I ran the above script and it needed many hours to run.

I collected predictions from a that script into a tar file you can download:

learn4.us/class09predictions.tar.bz2

I wrote a script to report on the contents of the above tar file:


"""
rpt09.py

This script should report on data in this download:

http://learn4.us/class09predictions.tar.bz2

Demo:
~/anaconda3/bin/python rpt09.py
"""

import glob
import os
import pandas as pd
import pdb
import re

# I should get the download:
shellcmd_s = '/usr/bin/curl -L learn4.us/class09predictions.tar.bz2 > /tmp/class09predictions.tar.bz2'
print('Busy with this:')
print(shellcmd_s)
os.system(shellcmd_s)

# I should untar it:
shellcmd_s = 'cd /tmp/; tar jxf class09predictions.tar.bz2'
print('Busy with this:')
print(shellcmd_s)
os.system(shellcmd_s)


# I should use glob.glob to create a list of file-names.
fn_l = glob.glob('/tmp/csv/predictions*USD*.csv')
# I should use that list to drive a loop:
for fn_s in sorted(fn_l):
    # I should use a RegExp to extract the integer and pair-name from fn_s
    pattern_re     = r'(predictions)(\d+)(.+)(.csv)'
    pattern_ma     = re.search(pattern_re, fn_s)
    train_i_s      = pattern_ma[2]
    pair_s         = pattern_ma[3]
    predictions_df = pd.read_csv(fn_s,names=['ts','cp','piplead','prediction','eff','acc'])
    effsum_f       = predictions_df.eff.sum()
    print('For this pair: ', pair_s, ' and this training count: ', train_i_s)
    print('Effectiveness sum is: ',effsum_f)
'bye'

I ran the above script and collected a screenshot:


dan@h79:~/ml4/public/class09 $ ~/anaconda3/bin/python rpt09.py
Busy with this:
/usr/bin/curl -L learn4.us/class09predictions.tar.bz2 > /tmp/class09predictions.tar.bz2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9251k  100 9251k    0     0  4503k      0  0:00:02  0:00:02 --:--:-- 6688k
Busy with this:
cd /tmp/; tar jxf class09predictions.tar.bz2
For this pair:  AUDUSD  and this training count:  101000
Effectiveness sum is:  -2876.6973999999905
For this pair:  EURUSD  and this training count:  101000
Effectiveness sum is:  4621.516400000013
For this pair:  GBPUSD  and this training count:  101000
Effectiveness sum is:  -7196.738100000065
For this pair:  USDCAD  and this training count:  101000
Effectiveness sum is:  1272.2763000000045
For this pair:  USDJPY  and this training count:  101000
Effectiveness sum is:  -3691.3764000000233
For this pair:  AUDUSD  and this training count:  103000
Effectiveness sum is:  -278.6045999999975
For this pair:  EURUSD  and this training count:  103000
Effectiveness sum is:  2991.208500000025
For this pair:  GBPUSD  and this training count:  103000
Effectiveness sum is:  593.0258999999729
For this pair:  USDCAD  and this training count:  103000
Effectiveness sum is:  5271.021599999977
For this pair:  USDJPY  and this training count:  103000
Effectiveness sum is:  -2922.7771000000125
For this pair:  AUDUSD  and this training count:  105000
Effectiveness sum is:  3929.921999999983
For this pair:  EURUSD  and this training count:  105000
Effectiveness sum is:  4155.316699999995
For this pair:  GBPUSD  and this training count:  105000
Effectiveness sum is:  -374.08340000001385
For this pair:  USDCAD  and this training count:  105000
Effectiveness sum is:  1737.2925999999743
For this pair:  USDJPY  and this training count:  105000
Effectiveness sum is:  -2638.909300000007
For this pair:  AUDUSD  and this training count:  107000
Effectiveness sum is:  8038.491699999903
For this pair:  EURUSD  and this training count:  107000
Effectiveness sum is:  365.45629999999056
For this pair:  GBPUSD  and this training count:  107000
Effectiveness sum is:  3691.8400999999226
For this pair:  USDCAD  and this training count:  107000
Effectiveness sum is:  -2203.800900000003
For this pair:  USDJPY  and this training count:  107000
Effectiveness sum is:  300.0485000000083
dan@h79:~/ml4/public/class09 $ 
dan@h79:~/ml4/public/class09 $

I see that the most predictable combination of currency-pair and observation-training-count is AUDUSD and 107000 which gives an effectiveness sum of 8038.5 pips.


Class09 Lab


learn4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen