I like Python:
"""
~/ml4/public/class04/linr10/linr11a.py
This script should implement the same ML idea as linr11a.scala:
- Read ml4.herokuapp.com/csv/GSPC.csv into DataFrame
- Generate dependent variable: "pctlead"
- Collect independent variables: slp2, ... slp9
- Create train_df, train_a from a filter: 1986-01-01 until 2016-01-01
- Create test_df, test_a from a filter: 2016-01-01 until 2017-01-01
- Use LinearRegression to fit model to train_a
- Predict observations in test_a
- Report accuracy and effectiveness of predictions
Demo:
python linr11a.py
"""
import pandas as pd
import numpy as np
from sklearn import linear_model
# I should Read ml4.herokuapp.com/csv/GSPC.csv into DataFrame
gspc_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')
# I should Generate dependent variable: "pctlead"
leadp_sr = gspc_df.Close.shift(-1)
pctlead_sr = (100.0*(leadp_sr - gspc_df.Close)/gspc_df.Close).fillna(0)
gspc_df['pctlead'] = pctlead_sr
# I should Collect independent variables: slp2, ... slp9
for slope_i in [2,3,4,5,6,7,8,9]:
rollx = gspc_df.rolling(window=slope_i)
col_s = 'slp'+str(slope_i)
slope_sr = 100.0 * (rollx.mean().Close - rollx.mean().Close.shift(1))/rollx.mean().Close
gspc_df[col_s] = slope_sr
# I should Create train_df, train_a from a filter: 1986-01-01 until 2016-01-01
filter_train_sr = (gspc_df.Date > '1986') & (gspc_df.Date < '2016')
columns_iwant_l = ['Date','pctlead','slp2','slp3','slp4','slp5','slp6','slp7','slp8','slp9']
train_df = gspc_df[columns_iwant_l].loc[filter_train_sr]
train_a = np.array(train_df)
# I should Create test_df, test_a from a filter: 2016-01-01 until 2017-01-01
filter_test_sr = (gspc_df.Date > '2016') & (gspc_df.Date < '2017')
test_df = gspc_df[columns_iwant_l].loc[filter_test_sr]
test_a = np.array(test_df)
# I should Use LinearRegression to fit model to train_a
linr_model = linear_model.LinearRegression()
x_a = train_a[:,2:] # I should get all rows, columns after column-0,1
y_a = train_a[:,1] # I should get all rows, column after column-0
linr_model.fit(x_a, y_a)
# I should Predict observations in test_a
x_test_a = test_a[:,2:]
predictions_a = linr_model.predict(x_test_a)
# I should Report accuracy and effectiveness of predictions
rpt_df = test_df[['Date','pctlead']].copy()
rpt_df['prediction'] = predictions_a.tolist()
pred_eff_sr = np.sign(rpt_df.pctlead * rpt_df.prediction)
acc_sr = (pred_eff_sr > 0).astype('int')
acc_lo_sr = (rpt_df.pctlead > 0).astype('int')
tp_i = rpt_df.loc[(rpt_df.prediction > 0) & (rpt_df.pctlead > 0)].Date.count()
fp_i = rpt_df.loc[(rpt_df.prediction > 0) & (rpt_df.pctlead < 0)].Date.count()
tn_i = rpt_df.loc[(rpt_df.prediction < 0) & (rpt_df.pctlead < 0)].Date.count()
fn_i = rpt_df.loc[(rpt_df.prediction < 0) & (rpt_df.pctlead > 0)].Date.count()
eff_lo_f = rpt_df.pctlead.sum()
eff_np_f = -rpt_df.loc[rpt_df.prediction < 0].pctlead.sum()
eff_pp_f = rpt_df.loc[rpt_df.prediction > 0].pctlead.sum()
print('Linear Regression Accuracy:', 100.0*acc_sr.sum()/acc_sr.count(),'%')
print('Long Only Accuracy:', 100.0*acc_lo_sr.sum()/acc_lo_sr.count(),'%')
print('True Positive Count:', tp_i)
print('False Positive Count:',fp_i)
print('True Negative Count:', tn_i)
print('False Negative Count:',fn_i)
print('Effectiveness of Negative Predictions:',eff_np_f)
print('Effectiveness of Positive Predictions:',eff_pp_f)
print('Effectiveness of Long-Only-Model:', eff_lo_f)
I captured output from the above script:
dan@h80:~/ml4/public/class04/linr10 $ python -i linr11a.py
Linear Regression Accuracy: 50.3968253968 %
Long Only Accuracy: 52.380952381 %
True Positive Count: 94
False Positive Count: 87
True Negative Count: 33
False Negative Count: 38
Effectiveness of Negative Predictions: -1.8008424925215736
Effectiveness of Positive Predictions: 10.542803008862226
Effectiveness of Long-Only-Model: 12.343645501383794
>>>
dan@h80:~/ml4/public/class04/linr10 $
dan@h80:~/ml4/public/class04/linr10 $
The results are different than the results from Spark.
I should check both scripts for bugs.