Class03 Answer:

Write Pandas syntax which mimics this SQL:


CREATE TABLE prices5 AS 
SELECT
cdate
,closep
,AVG(closep)OVER(ORDER BY cdate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) mvgavg5day
FROM prices
WHERE cdate > '2016-08-01'
ORDER BY cdate;

SELECT * FROM prices5 WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31' ORDER BY cdate;

I started this lab by running the above SQL:


dan@a78:~/ml4/public/class03demos $ psql.bash -f mvgavg5day.sql
CREATE TABLE prices5 AS 
SELECT
cdate
,closep
,AVG(closep)OVER(ORDER BY cdate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) mvgavg5day
FROM prices
WHERE cdate > '2016-08-01'
ORDER BY cdate;
SELECT 220
SELECT * FROM prices5 WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31' ORDER BY cdate;
   cdate    |   closep   |      mvgavg5day      
------------+------------+----------------------
 2016-08-02 | 160.580002 | 160.5800020000000000
 2016-08-03 | 160.669998 | 160.6250000000000000
 2016-08-04 | 161.550003 | 160.9333343333333333
 2016-08-05 | 163.500000 | 161.5750007500000000
 2016-08-08 | 162.039993 | 161.6679992000000000
 2016-08-09 | 161.770004 | 161.9059996000000000
 2016-08-10 | 162.080002 | 162.1880004000000000
 2016-08-11 | 163.529999 | 162.5839996000000000
 2016-08-12 | 161.949997 | 162.2739990000000000
 2016-08-15 | 161.880005 | 162.2420014000000000
 2016-08-16 | 160.699997 | 162.0280000000000000
 2016-08-17 | 160.440002 | 161.7000000000000000
 2016-08-18 | 161.360001 | 161.2660004000000000
 2016-08-19 | 160.039993 | 160.8839996000000000
 2016-08-22 | 160.000000 | 160.5079986000000000
 2016-08-23 | 160.259995 | 160.4199982000000000
 2016-08-24 | 159.050003 | 160.1419984000000000
 2016-08-25 | 158.630005 | 159.5959992000000000
 2016-08-26 | 158.320007 | 159.2520020000000000
 2016-08-29 | 159.720001 | 159.1960022000000000
 2016-08-30 | 159.399994 | 159.0240020000000000
 2016-08-31 | 158.880005 | 158.9900024000000000
(22 rows)

dan@a78:~/ml4/public/class03demos $ 
dan@a78:~/ml4/public/class03demos $ 
dan@a78:~/ml4/public/class03demos $ 

Next, I wrote the script listed below:


"""
class03pd42.py

This script should mimic the SQL listed below:

-- Window Functions (5 Day Moving Average)
CREATE TABLE prices5 AS 
SELECT
cdate
,closep
,AVG(closep)OVER(ORDER BY cdate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) mvgavg5day
FROM prices
WHERE cdate > '2016-08-01'
ORDER BY cdate;

SELECT * FROM prices5 WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31' ORDER BY cdate;
"""

# SQL lacks the ability to get data from the web.
# Pandas can get data from the web:

import pandas as pd

prices_df         = pd.read_csv('http://ml4.us/csv/ibm.csv')
prices_df.columns = ['cdate_s', 'openp', 'highp', 'lowp', 'closep', 'adjp', 'volume']

prices5_df = prices_df[['cdate_s','closep']].sort_values(by=['cdate_s'])
myr        = prices5_df.rolling(window=5)
prices5_df['mvgavg5day'] = myr.mean().closep
pred_sr    = (prices5_df.cdate_s > '2016-08-01')
prices5_df =  prices5_df[pred_sr]
pred2_sr   = (prices5_df.cdate_s >= '2016-08-01') & (prices5_df.cdate_s <= '2016-08-31')
print(prices5_df[pred2_sr])
'bye'

I saw this:


dan@a78:~/ml4/public/class03demos $ python class03pd42.py 
          cdate_s      closep  mvgavg5day
13740  2016-08-02  160.580002  161.169998
13741  2016-08-03  160.669998  160.937997
13742  2016-08-04  161.550003  160.973999
13743  2016-08-05  163.500000  161.550000
13744  2016-08-08  162.039993  161.667999
13745  2016-08-09  161.770004  161.906000
13746  2016-08-10  162.080002  162.188000
13747  2016-08-11  163.529999  162.584000
13748  2016-08-12  161.949997  162.273999
13749  2016-08-15  161.880005  162.242001
13750  2016-08-16  160.699997  162.028000
13751  2016-08-17  160.440002  161.700000
13752  2016-08-18  161.360001  161.266000
13753  2016-08-19  160.039993  160.884000
13754  2016-08-22  160.000000  160.507999
13755  2016-08-23  160.259995  160.419998
13756  2016-08-24  159.050003  160.141998
13757  2016-08-25  158.630005  159.595999
13758  2016-08-26  158.320007  159.252002
13759  2016-08-29  159.720001  159.196002
13760  2016-08-30  159.399994  159.024002
13761  2016-08-31  158.880005  158.990002
dan@a78:~/ml4/public/class03demos $ 
dan@a78:~/ml4/public/class03demos $

Notice that the Python output matches the SQL output after the first four rows.

Pandas calculates the first 4 moving-averages differently than how SQL does it.

Class03 Lab


ml4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen