Class03 Answer:

Write NumPy syntax which does something similar to the Pandas syntax below:


"""
pct_laglead.py

This script should compute columns pctlag1 and pctlead from closep.
"""

import pandas as pd

prices_df = pd.read_csv('http://ml4.us/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']

# I should get 2016 July and two columns:
pred_sr = (prices_df.cdate_s > '2016-07') & (prices_df.cdate_s < '2016-08')
s1_df   = prices_df[pred_sr][['cdate_s','closep']]

pctlead = 100 * (s1_df.closep.shift(-1) - s1_df.closep) / s1_df.closep
pctlag1 = 100 * (s1_df.closep - s1_df.closep.shift(1)) / s1_df.closep.shift(1)

s1_df['pctlag1'] = pctlag1
s1_df['pctlead'] = pctlead

# I should visualize:
print(s1_df)

'bye'

I ran the above script and saw this:


dan@a78:~/ml4/public/class03demos $ python pct_laglead.py
          cdate_s       closep   pctlag1   pctlead
16732  2016-07-01  2102.949951       NaN -0.684748
16733  2016-07-05  2088.550049 -0.684748  0.535296
16734  2016-07-06  2099.729980  0.535296 -0.087158
16735  2016-07-07  2097.899902 -0.087158  1.525335
16736  2016-07-08  2129.899902  1.525335  0.340862
16737  2016-07-11  2137.159912  0.340862  0.700929
16738  2016-07-12  2152.139893  0.700929  0.013477
16739  2016-07-13  2152.429932  0.013477  0.525920
16740  2016-07-14  2163.750000  0.525920 -0.092895
16741  2016-07-15  2161.739990 -0.092895  0.238230
16742  2016-07-18  2166.889893  0.238230 -0.143517
16743  2016-07-19  2163.780029 -0.143517  0.427030
16744  2016-07-20  2173.020020  0.427030 -0.361253
16745  2016-07-21  2165.169922 -0.361253  0.455396
16746  2016-07-22  2175.030029  0.455396 -0.301148
16747  2016-07-25  2168.479980 -0.301148  0.032278
16748  2016-07-26  2169.179932  0.032278 -0.119854
16749  2016-07-27  2166.580078 -0.119854  0.160621
16750  2016-07-28  2170.060059  0.160621  0.163131
16751  2016-07-29  2173.600098  0.163131       NaN
dan@a78:~/ml4/public/class03demos $ 
dan@a78:~/ml4/public/class03demos $ 

I can see that the above script creates a column named pctlag1 from closep.

Each time closep is observed, pctlag1 tells me how much closep has changed after the previous observation.

Also the script creates a similar column named pctlead.

I have two ways to understand pctlead.

The easiest way is to see pctlead is that it is a shifted version of pctlag1.

Or, today's pctlag1 is yesterday's pctlead.

Another way to see it is as a one day look into the future, for each day in the past.

Today, however, pctlead is unknown because closep for tomorrow is unknown.

So the above script uses Pandas shift() to help me compute pctlag1 and pctlead.

A way to compute pctlag1, pctlead using NumPy is displayed below:


"""
class03np18.py

This script should compute columns pctlag1 and pctlead from closep.
"""

import pandas as pd
import numpy  as np

prices_df = pd.read_csv('http://ml4.us/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']

# I should get 2016 July and two columns.
# I should do it the numpy way:
prices_a = np.array(prices_df)

# I should get all rows where column-0 > '2016-07'
pred1_a  = (prices_a[:,0] > '2016-07')
# I should get all rows where column-0 < '2016-08'
pred2_a  = (prices_a[:,0] < '2016-08')

# I should slice out july:
july_a   = prices_a[pred1_a & pred2_a]

# I should get all rows and get columns 0 and 4:
s1_a      = july_a[:,[0,4]]
# I should create a column of closing prices:
cp_a      = s1_a[:,[1]]
# 0th closing price:
elem0     = cp_a[:1]
# last cp:
elem_last = cp_a[-1:]
# duplicate elem_last at the end:
lead_a    = np.vstack((cp_a , elem_last))
# duplicate elem0 at the start:
lag1_a    = np.vstack((elem0, cp_a     ))
# Easy calculations:
pctlag1   = 100 * ( cp_a - lag1_a[:-1] ) / lag1_a[:-1] 
pctlead   = 100 * ( lead_a[1:] - cp_a) / cp_a
# I should do what the Pandas script did:
s1_a      = np.hstack((s1_a,pctlag1))
s1_a      = np.hstack((s1_a,pctlead))

# I should visualize:
print(s1_a)

'bye'

I ran it and saw this:


dan@a78:~/ml4/public/class03demos $ python class03np18.py
[['2016-07-01' 2102.949951 0.0 -0.6847477275031026]
 ['2016-07-05' 2088.550049 -0.6847477275031026 0.5352962934909359]
 ['2016-07-06' 2099.72998 0.5352962934909359 -0.08715777825870533]
 ['2016-07-07' 2097.899902 -0.08715777825870533 1.5253349299217422]
 ['2016-07-08' 2129.899902 1.5253349299217422 0.34086155847900335]
 ['2016-07-11' 2137.159912 0.34086155847900335 0.7009293462734553]
 ['2016-07-12' 2152.139893 0.7009293462734553 0.013476772627251273]
 ['2016-07-13' 2152.429932 0.013476772627251273 0.5259203949780415]
 ['2016-07-14' 2163.75 0.5259203949780415 -0.09289474292316421]
 ['2016-07-15' 2161.73999 -0.09289474292316421 0.23822952916738127]
 ['2016-07-18' 2166.889893 0.23822952916738127 -0.14351739837110478]
 ['2016-07-19' 2163.780029 -0.14351739837110478 0.4270300527854589]
 ['2016-07-20' 2173.02002 0.4270300527854589 -0.3612529073708161]
 ['2016-07-21' 2165.169922 -0.3612529073708161 0.45539645178943006]
 ['2016-07-22' 2175.030029 0.45539645178943006 -0.30114752038671483]
 ['2016-07-25' 2168.47998 -0.30114752038671483 0.03227846263076587]
 ['2016-07-26' 2169.179932 0.03227846263076587 -0.11985423438815265]
 ['2016-07-27' 2166.580078 -0.11985423438815265 0.16062092674702202]
 ['2016-07-28' 2170.060059 0.16062092674702202 0.16313092282023237]
 ['2016-07-29' 2173.600098 0.16313092282023237 0.0]]
dan@a78:~/ml4/public/class03demos $ 
dan@a78:~/ml4/public/class03demos $ 

That report looks different than the report from Pandas (which has an easier to read format) but the values from each report do match.

Notice that Pandas handles certain data-endpoints differently.

For example the 2016-07-29 row lists NaN as the value for pctlead.

In the NumPy report the same row lists 0.0 as the value for pctlead.

Another obvious difference between the two reports is that Pandas labels the columns and NumPy does not.

Class03 Lab


ml4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen