In Class03 we review topics and techniques from Class01 and Class02. Then, look at the topics listed below and work towards transforming understanding into projects:
This discussion contains 3 languages: English, Math, and NumPy. Sometimes the Math looks similar to the NumPy.
Linear Algebra teaches us about three types of objects: Scalar, Vector, Matrix. A Scalar is simple; it is just a number. For example, here is a Scalar: 3.2 A Vector is a collection of scalars. A vector can be a row vector or a column vector. If I don't know what kind of vector I have, I assume my vector is a column vector. Here is a representation of a row vector made from three scalars: [ 1.1 1.2 -4.8 ] A Matrix is a collection of vectors. A matrix can be built from row vectors or column vectors. Also a matrix can be built directly from scalars. Here is a representation of a matrix created from two row vectors: [ 1.1 1.2 -4.8 1.1 1.2 -4.8 ] I can use NumPy arrays to represent both vectors and matrices:I can use NumPy to simulate vector addition:import numpy as np vector_a = np.array([1.1, 1.2, -4.8 ]) matrix_a = np.array([vector_a,vector_a])
I can use NumPy to simulate matrix addition:sum_a = vector_a + vector_a
I can use NumPy to simulate vector product which should give me a scalar:msum_a = matrix_a + matrix_a
I can use NumPy to simulate vector, matrix product which should give me an error:vprod_a = np.matmul(vector_a, vector_a)
The above NumPy statement will fail, which is consistent with the rules of Linear Algebra. The rules say that the vector_a on the left needs 2 columns because matrix_a has 2 rows. But vector_a is a column vector so it has only 1 column. I can use the NumPy shape attribute to see the number of rows and columns:vmprod_a = np.matmul(vector_a, matrix_a)
The above expressions tell me that vector_a has 3 rows. When I created vector_a I did not specify what kind of vector I wanted. So, NumPy gave me a column vector. Also I see that matrix_a has 2 rows and 3 columns. So I see that the number of columns in matrix_a match the row count in vector_a. So, this expression should succeed:vshape = vector_a.shape mshape = matrix_a.shape # Notice that .shape is an attribute, not a method.
Also, matrix, vector multiplication can be visualized: https://www.google.com/search?tbm=isch&q=matrix+vector+multiplicationmvprod_a = np.matmul(matrix_a, vector_a) # mvprod_a and matrix_a should have same number of rows.
I use this English phrase to help me remember multiplication: "I should match each ROW on left with a COLUMN on right." The above phrase is easy to remember; this phrase describes the operation: - Rotate each row on the left to match the column on the right. - I will see that columns on left should match rows on right. - Multiply each member. - Sum all products. - Copy sum into new row of a column vector. So, if I have a vector on the right, it needs to be a column vector.
I start this discussion with some NumPy syntax:I represent Linear Algebra straight line expression:import numpy as np w0 = 1.01 # Im a scalar w1 = 0.51 # Im a scalar w_a = np.array([w0, w1]) # Im a column vector x0 = 1 # Im an integer scalar x1 = 0.7 # Im a scalar x_a = np.array([x0, x1]).reshape(1,2) # Im a row vector
I represent Scalar Algebra straight line expression:y_a = np.matmul(x_a, w_a)
I should now see that the mathematical scalar expression: y = mx + b can be written as y = w0 + w1 x which can be written as y = 1 w0 + x w1 which can be written as a Linear Algebra expression: y = Xw where X is a one row matrix which looks like this: [ 1 x1 ] and w is a column vector which looks like this: [ w0 w1 ] Question: The scalar expression for a straight line is this: y = mx + b and the linear algebra expression for a straight line is this: y = Xw why does b disappear? Answer: The term b gets "hidden" inside of the vector w. In fact b is the 0th member of w.m = w1 b = w0 y_f = m*x1 + b
I start this discussion with some syntax:Next, I propose we ask google a question: https://www.google.com/search?q=with+python+matplotlib+how+to+do+scatterplot With google, I studied the above question and I offer more syntax:x0 = 0.0 y0 = 1.0 x1 = 1.0 y1 = 1.5
I ran the above code and saw this: The scatter plot has only two points so I know it is easy to fit a line to the points. I know the line has the form: y = mx + b I can see from the plot that when x is 0, y is 1, so b is 1. And I know that m is (y1 - y0) / (x1 - x0) is 0.5. I can use matplotlib to display a line through the two points:import matplotlib.pyplot as plt plt.scatter([x0, x1], [y0, y1]) plt.show()
I ran the above code and saw this: Next I add a third point which is near (x1, y1):import matplotlib.pyplot as plt x0 = 0.0 y0 = 1.0 x1 = 1.0 y1 = 1.5 plt.scatter([x0, x1], [y0, y1]) plt.plot( [x0, x1], [y0, y1]) plt.show()
I plot:x2 = 1.10 y2 = 1.48
I saw this: I want to fit a line to the three points. The third point should act like a magnet. It should attract the straight line away from the second point. So, that logic tells me that m will be less than 0.5 and near 0.5. And b will probably be greater than 1 but near 1. I can use syntax we learned in class01 to find m and b:import matplotlib.pyplot as plt plt.scatter([x0, x1, x2], [y0, y1, y2]) plt.show()
When I ran the above syntax on my laptop, sklearn gave me these values for m and b: m == array([0.46216216]) b == 1.0031531531531532 I wrote syntax to show the line:from sklearn import linear_model linr_model = linear_model.LinearRegression() x_l = [[x0],[x1],[x2]] y_l = [y0, y1, y2 ] linr_model.fit(x_l, y_l) m = linr_model.coef_ b = linr_model.intercept_
I saw this:from sklearn import linear_model import matplotlib.pyplot as plt x0 = 0.0 y0 = 1.0 x1 = 1.0 y1 = 1.5 x2 = 1.10 y2 = 1.48 plt.scatter([x0, x1, x2], [y0, y1, y2]) m = 0.462 b = 1.003 yhat_l = [ m*x_f + b for x_f in[x0, x1, x2] ] plt.plot([x0, x1, x2], yhat_l) plt.show()
We can use Linear Algebra to fit a line to a Scatter Plot. In Linear Algebra, I describe a line with this simple expression: y = Xw Since we have the scatter plot, we know what both y and X are. In the Scalar world, I find w with Scalar Algebra: w = y / X In the world of Linear Algebra, finding w is difficult. One reason for this is that division of a vector by a matrix is not defined. I propose we ask google a question: https://www.google.com/search?q=with+Linear+Algebra+how+to+fit+a+line+to+scatter+plot I studied results from google. Now, I claim this to be a Linear Algebra solution for w: w =
X' is called: "X Transpose". What is "X Transpose"? I explain: - If X is a row vector then X' is a column vector made from the elements of X. - Or just visualize X' to be X rotated 90 degrees. - If X is a matrix, then X' is made from row vectors rotated into column vectors. - The first row becomes the first column. - The second row becomes the second column... X raised to the -1 is called: "Inverse of X". What is "Inverse of X"? I suggest you ask Google: http://www.google.com/search?tbm=isch&q=In+Linear+Algebra+what+is+inverse+of+matrix A better question: How to calculate "Inverse of X"? Calculating the Inverse of X is not always easy, but NumPy offers a method called pinv() which gives us a good approximation. Now, I can offer some NumPy syntax to fit a line to a scatter plot of three points:After I ran the above syntax on my laptop, NumPy gave me this value for w_a: >>> w_a array([[1.00315315], [0.46216216]]) >>> When I compare w_a to linr_model from sklearn, I see that: w_a[0] == linr_model.intercept_ w_a[1] == linr_model.coef_ And, I am satisfied that Linear Algebra can fit a line to a scatter plot as well as sklearn.import numpy as np import pandas as pd x0 = 0.0; y0 = 1.0 x1 = 1.0; y1 = 1.5 x2 = 1.10; y2 = 1.48 # It is easy to build a NumPy array from rows. # But, # To build a NumPy array from columns, # I first create a Pandas DF from columns, # then convert the DF to array: col0_l = [1, 1, 1] col1_l = [x0, x1, x2] my_df = pd.DataFrame({'col0': col0_l, 'col1': col1_l}) x_a = np.array(my_df) y_a = np.array([y0, y1, y2]).reshape(3,1) lhs_a = np.linalg.pinv(np.matmul(x_a.T, x_a)) rhs_a = np.matmul(x_a.T, y_a) w_a = np.matmul(lhs_a, rhs_a)
#!/bin/bash
# wgetibm.bash
# This script should download ibm.csv and then load it into a table called prices.
cd /tmp/
rm -f ibm.csv
/usr/bin/wget https://ml4.herokuapp.com/csv/ibm.csv
/usr/bin/psql -aP pager=no<<EOF
DROP TABLE IF EXISTS prices;
CREATE TABLE prices(
cdate date
,openp numeric
,highp numeric
,lowp numeric
,closep numeric
,adjp numeric
,volume numeric
)
;
COPY prices(
cdate
,openp
,highp
,lowp
,closep
,adjp
,volume
) FROM '/tmp/ibm.csv' WITH csv header
;
EOF
exit
dan@h79:~/ml4/public/class03demos $ ./wgetibm.bash
--2017-06-16 19:25:40-- https://ml4.herokuapp.com/csv/ibm.csv
Resolving ml4.herokuapp.com (ml4.herokuapp.com)... 184.168.221.28
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|184.168.221.28|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /csv/ibm.csv [following]
--2017-06-16 19:25:40-- https://ml4.herokuapp.com/csv/ibm.csv
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|184.168.221.28|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ml4.herokuapp.com/csv/ibm.csv [following]
--2017-06-16 19:25:40-- https://ml4.herokuapp.com/csv/ibm.csv
Resolving ml4.herokuapp.com (ml4.herokuapp.com)... 54.243.158.73
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|54.243.158.73|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957716 (935K) [text/csv]
Saving to: ‘ibm.csv’
ibm.csv 100%[===================>] 935.27K 1.83MB/s in 0.5s
2017-06-16 19:25:41 (1.83 MB/s) - ‘ibm.csv’ saved [957716/957716]
DROP TABLE IF EXISTS prices;
DROP TABLE
CREATE TABLE prices(
cdate date
,openp numeric
,highp numeric
,lowp numeric
,closep numeric
,adjp numeric
,volume numeric
)
;
CREATE TABLE
COPY prices(
cdate
,openp
,highp
,lowp
,closep
,adjp
,volume
) FROM '/tmp/ibm.csv' WITH csv header
;
COPY 13960
dan@h79:~/ml4/public/class03demos $
dan@h79:~/ml4/public/class03demos $
SELECT COUNT(cdate) FROM prices;
SELECT COUNT(lowp) FROM prices;
SELECT COUNT(*) FROM prices;
SELECT MIN(cdate),MIN(closep),MIN(volume) FROM prices;
SELECT MAX(closep),MAX(volume), AVG(closep),AVG(volume) FROM prices;
SELECT * FROM prices WHERE cdate = (SELECT MAX(cdate) FROM prices);
SELECT cdate,closep FROM prices WHERE cdate = (SELECT MAX(cdate)-1 FROM prices);
SELECT cdate,closep FROM prices WHERE cdate > (SELECT MAX(cdate)-10 FROM prices);
SELECT max(closep) FROM
(SELECT cdate,closep FROM prices WHERE cdate > (SELECT MAX(cdate)-10 FROM prices)) iv;
DROP TABLE IF EXISTS prices2;
CREATE TABLE prices2 AS SELECT cdate,openp,closep FROM prices WHERE cdate > '2016-08-08';
DROP TABLE IF EXISTS prices3;
CREATE TABLE prices3 AS SELECT cdate,openp,closep, closep - openp AS diff FROM prices2;
INSERT INTO prices3
SELECT cdate,openp,closep, closep - openp AS diff FROM prices
WHERE cdate = '2016-08-01';
INSERT INTO prices3
SELECT cdate,openp,closep, closep - openp AS diff FROM prices
WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31'
AND cdate NOT IN (SELECT cdate FROM prices3);
UPDATE prices3 SET diff = 0;
UPDATE prices3 SET diff = 1 WHERE cdate = '2016-08-01';
UPDATE prices3 SET diff = closep - openp;
DELETE FROM prices3 WHERE cdate = '2016-08-10';
DELETE FROM prices3 WHERE cdate < '2016-08-10';
DELETE FROM prices3;
CREATE TABLE prices4 AS
SELECT
extract(year from cdate) yr
,cdate
,closep
FROM prices
WHERE extract(year from cdate) > 1999;
SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr;
SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr ORDER BY yr;
SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr ORDER BY COUNT(yr);
CREATE TABLE prices5 AS
SELECT
cdate
,closep
,AVG(closep)OVER(ORDER BY cdate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) mvgavg5day
FROM prices
WHERE cdate > '2016-08-01'
ORDER BY cdate;
SELECT * FROM prices5 WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31' ORDER BY cdate;
"""
class03pd44.py
This script should sort df by cdate
"""
import pandas as pd
prices_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']
prices_df = prices_df.sort_values(by=['cdate_s'])
print(prices_df.head())
'bye'
"""
class03pd45.py
This script should demonstrate flexibility to locate iloc[]
when I want to slice and dice.
Demo:
python class03pd45.py
"""
import pandas as pd
import numpy as np
prices_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']
# I should get 2 columns and rows 5 through 11:
s1_df = prices_df[['cdate_s','closep']].iloc[5:11]
print(s1_df)
# I do same as above but in different order.
# I should get rows 5 through 11 and 2 columns:
s2_df = prices_df.iloc[5:11][['cdate_s','closep']]
print(s2_df)
# s1_df and s2_df should have same data.
'bye'
"""
class03pd47.py
This script should slice and dice.
Demo:
python class03pd47.py
"""
import pandas as pd
import numpy as np
prices_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']
# I should get 2016 July and two columns:
pred_sr = (prices_df.cdate_s > '2016-07') & (prices_df.cdate_s < '2016-08')
s1_df = prices_df.loc[pred_sr][['cdate_s','closep']]
print(s1_df)
'bye'
"""
pct_laglead.py
This script should compute columns pctlag1 and pctlead from closep.
Demo:
python pct_laglead.py
"""
import pandas as pd
prices_df = pd.read_csv('https://ml4.herokuapp.com/csv/GSPC.csv')
prices_df.columns = ['cdate_s','openp','highp','lowp','closep','adjp','volume']
# I should get 2016 July and two columns:
pred_sr = (prices_df.cdate_s > '2016-07') & (prices_df.cdate_s < '2016-08')
s1_df = prices_df.loc[pred_sr][['cdate_s','closep']]
pctlead = 100 * (s1_df.closep.shift(-1) - s1_df.closep) / s1_df.closep
pctlag1 = 100 * (s1_df.closep - s1_df.closep.shift(1)) / s1_df.closep.shift(1)
s1_df['pctlag1'] = pctlag1
s1_df['pctlead'] = pctlead
# I should visualize:
print(s1_df)
'bye'