Class04 Answer:

Can you transform linr11a.scala into linr11a.python?

I like Python:


"""
~/ml4/public/class04/linr10/linr11a.py

This script should implement the same ML idea as linr11a.scala:

- Read ml4.herokuapp.com/csv/GSPC.csv into DataFrame
- Generate dependent variable: "pctlead"
- Collect independent variables: slp2, ... slp9
- Create train_df, train_a from a filter: 1986-01-01 until 2016-01-01
- Create test_df,  test_a  from a filter: 2016-01-01 until 2017-01-01
- Use LinearRegression to fit model to train_a
- Predict observations in test_a
- Report accuracy and effectiveness of predictions

Demo:
python linr11a.py
"""

import pandas as pd
import numpy  as np
from sklearn import linear_model

# I should Read ml4.herokuapp.com/csv/GSPC.csv into DataFrame
gspc_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')

# I should Generate dependent variable: "pctlead"
leadp_sr   = gspc_df.Close.shift(-1)
pctlead_sr = (100.0*(leadp_sr - gspc_df.Close)/gspc_df.Close).fillna(0)
gspc_df['pctlead'] = pctlead_sr
  
# I should Collect independent variables: slp2, ... slp9
for slope_i in [2,3,4,5,6,7,8,9]:
  rollx          = gspc_df.rolling(window=slope_i)
  col_s          = 'slp'+str(slope_i)
  slope_sr       = 100.0 * (rollx.mean().Close - rollx.mean().Close.shift(1))/rollx.mean().Close
  gspc_df[col_s] = slope_sr

# I should Create train_df, train_a from a filter: 1986-01-01 until 2016-01-01
filter_train_sr = (gspc_df.Date > '1986') & (gspc_df.Date < '2016')
columns_iwant_l = ['Date','pctlead','slp2','slp3','slp4','slp5','slp6','slp7','slp8','slp9']
train_df        = gspc_df[columns_iwant_l].loc[filter_train_sr]
train_a         = np.array(train_df)

# I should Create test_df,  test_a  from a filter: 2016-01-01 until 2017-01-01
filter_test_sr = (gspc_df.Date > '2016') & (gspc_df.Date < '2017')
test_df        = gspc_df[columns_iwant_l].loc[filter_test_sr]
test_a         = np.array(test_df)

# I should Use LinearRegression to fit model to train_a
linr_model = linear_model.LinearRegression()
x_a = train_a[:,2:] # I should get all rows, columns after column-0,1
y_a = train_a[:,1]  # I should get all rows, column  after column-0
linr_model.fit(x_a, y_a)

# I should Predict observations in test_a
x_test_a      = test_a[:,2:]
predictions_a = linr_model.predict(x_test_a)

# I should Report accuracy and effectiveness of predictions
rpt_df               = test_df[['Date','pctlead']].copy()
rpt_df['prediction'] = predictions_a.tolist()
pred_eff_sr          = np.sign(rpt_df.pctlead * rpt_df.prediction)
acc_sr    = (pred_eff_sr    > 0).astype('int')
acc_lo_sr = (rpt_df.pctlead > 0).astype('int')
tp_i = rpt_df.loc[(rpt_df.prediction > 0) & (rpt_df.pctlead > 0)].Date.count()
fp_i = rpt_df.loc[(rpt_df.prediction > 0) & (rpt_df.pctlead < 0)].Date.count()

tn_i = rpt_df.loc[(rpt_df.prediction < 0) & (rpt_df.pctlead < 0)].Date.count()
fn_i = rpt_df.loc[(rpt_df.prediction < 0) & (rpt_df.pctlead > 0)].Date.count()

eff_lo_f = rpt_df.pctlead.sum()

eff_np_f = -rpt_df.loc[rpt_df.prediction < 0].pctlead.sum()
eff_pp_f =  rpt_df.loc[rpt_df.prediction > 0].pctlead.sum()

print('Linear Regression Accuracy:', 100.0*acc_sr.sum()/acc_sr.count(),'%')
print('Long Only Accuracy:', 100.0*acc_lo_sr.sum()/acc_lo_sr.count(),'%')

print('True Positive Count:', tp_i)
print('False Positive Count:',fp_i)

print('True Negative Count:', tn_i)
print('False Negative Count:',fn_i)

print('Effectiveness of Negative Predictions:',eff_np_f)
print('Effectiveness of Positive Predictions:',eff_pp_f)
print('Effectiveness of Long-Only-Model:',     eff_lo_f)

I captured output from the above script:


dan@h80:~/ml4/public/class04/linr10 $ python -i linr11a.py
Linear Regression Accuracy: 50.3968253968 %
Long Only Accuracy: 52.380952381 %
True Positive Count: 94
False Positive Count: 87
True Negative Count: 33
False Negative Count: 38
Effectiveness of Negative Predictions: -1.8008424925215736
Effectiveness of Positive Predictions: 10.542803008862226
Effectiveness of Long-Only-Model: 12.343645501383794
>>> 
dan@h80:~/ml4/public/class04/linr10 $ 
dan@h80:~/ml4/public/class04/linr10 $

The results are different than the results from Spark.

I should check both scripts for bugs.

Class04 Lab


learn4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen