# Class03

In Class03 we review topics and techniques from Class01 and Class02. Then, look at the topics listed below and work towards transforming understanding into projects:

## Topics:

• Linear Algebra
• NumPy
• Pandas
• SQL
• Linear Regression

## Questions:

• What is Linear Algebra?
• Why learn NumPy?
• How to learn Pandas?
• Is SQL easy to use?
• How to visualize Linear Regression?

## Lab (Pandas, Plotting, NumPy, Linear Algebra):

• Use Pandas to plot prices of GSPC for 2016.
• Plot a straight line from first price to last price.
• Calculate RMSE for that line.
• Find Linear Algebra expression to fit a line to a scatter plot.

# Simple Linear Algebra + NumPy

This discussion contains 3 languages: English, Math, and NumPy. Sometimes the Math looks similar to the NumPy.

```
Linear Algebra teaches us about three types of objects:
Scalar, Vector, Matrix.

A Scalar is simple; it is just a number.

For example, here is a Scalar: 3.2

A Vector is a collection of scalars.

A vector can be a row vector or a column vector.

If I don't know what kind of vector I have, I assume my vector is a column vector.

Here is a representation of a row vector made from three scalars:

[ 1.1 1.2 -4.8 ]

A Matrix is a collection of vectors.

A matrix can be built from row vectors or column vectors.

Also a matrix can be built directly from scalars.

Here is a representation of a matrix created from two row vectors:

[
1.1 1.2 -4.8
1.1 1.2 -4.8
]

I can use NumPy arrays to represent both vectors and matrices:

```
import numpy as np

vector_a = np.array([1.1, 1.2, -4.8 ])
matrix_a = np.array([vector_a,vector_a])

```

I can use NumPy to simulate vector addition:
```
sum_a = vector_a + vector_a

```

I can use NumPy to simulate matrix addition:
```
msum_a = matrix_a + matrix_a

```

I can use NumPy to simulate vector product which should give me a scalar:
```
vprod_a = np.matmul(vector_a, vector_a)

```

I can use NumPy to simulate vector, matrix product which should give me an error:
```
vmprod_a = np.matmul(vector_a, matrix_a)

```

The above NumPy statement will fail, which is consistent with the rules of Linear Algebra.

The rules say that the vector_a on the left needs 2 columns
because matrix_a has 2 rows.
But vector_a is a column vector so it has only 1 column.

I can use the NumPy shape attribute to see the number of rows and columns:
```
vshape = vector_a.shape
mshape = matrix_a.shape
# Notice that .shape is an attribute, not a method.

```

The above expressions tell me that vector_a has 3 rows.

When I created vector_a I did not specify what kind of vector I wanted.

So, NumPy gave me a column vector.

Also I see that matrix_a has 2 rows and 3 columns.

So I see that the number of columns in matrix_a match the row count in vector_a.

So, this expression should succeed:
```
mvprod_a = np.matmul(matrix_a, vector_a)
# mvprod_a and matrix_a should have same number of rows.

```

Also, matrix, vector multiplication can be visualized:

I use this English phrase to help me remember multiplication:

"I should match each ROW on left with a COLUMN on right."

The above phrase is easy to remember; this phrase describes the operation:

- Rotate each row on the left to match the column on the right.
- I will see that columns on left should match rows on right.
- Multiply each member.
- Copy sum into new row of a column vector.

So, if I have a vector on the right, it needs to be a column vector.
```

# A Straight Line

```
I start this discussion with some NumPy syntax:
```
import numpy as np

w0 = 1.01 # Im a scalar
w1 = 0.51 # Im a scalar

w_a = np.array([w0, w1]) # Im a column vector

x0 = 1 # Im an integer scalar
x1 = 0.7 # Im a scalar

x_a = np.array([x0, x1]).reshape(1,2) # Im a row vector

```

I represent Linear Algebra straight line expression:
```
y_a = np.matmul(x_a, w_a)

```

I represent Scalar Algebra straight line expression:
```
m = w1
b = w0

y_f = m*x1 + b

```

I should now see that the mathematical scalar expression:

y = mx + b

can be written as

y = w0 + w1 x

which can be written as

y = 1 w0 + x w1

which can be written as a Linear Algebra expression:

y = Xw

where X is a one row matrix which looks like this:

[ 1 x1 ]

and w is a column vector which looks like this:

[
w0
w1
]

Question:

The scalar expression for a straight line is this:

y = mx + b

and the linear algebra expression for a straight line is this:

y = Xw

why does b disappear?

The term b gets "hidden" inside of the vector w.
In fact b is the 0th member of w.

```

# A Scatter Plot + sklearn

```I start this discussion with some syntax:
```
x0 = 0.0
y0 = 1.0

x1 = 1.0
y1 = 1.5

```

With google, I studied the above question and I offer more syntax:
```
import matplotlib.pyplot as plt
plt.scatter([x0, x1], [y0, y1])
plt.show()

```

I ran the above code and saw this:

The scatter plot has only two points so I know it is easy to fit a line to the points.

I know the line has the form: y = mx + b

I can see from the plot that when x is 0, y is 1, so b is 1.

And I know that m is (y1 - y0) / (x1 - x0) is 0.5.

I can use matplotlib to display a line through the two points:

```
import matplotlib.pyplot as plt
x0 = 0.0
y0 = 1.0
x1 = 1.0
y1 = 1.5

plt.scatter([x0, x1], [y0, y1])
plt.plot(   [x0, x1], [y0, y1])
plt.show()

```

I ran the above code and saw this:

Next I add a third point which is near (x1, y1):
```
x2 = 1.10
y2 = 1.48

```

I plot:
```
import matplotlib.pyplot as plt
plt.scatter([x0, x1, x2], [y0, y1, y2])
plt.show()

```

I saw this:

I want to fit a line to the three points.

The third point should act like a magnet.

It should attract the straight line away from the second point.

So, that logic tells me that m will be less than 0.5 and near 0.5.
And b will probably be greater than 1 but near 1.

I can use syntax we learned in class01 to find m and b:
```
from sklearn import linear_model
linr_model = linear_model.LinearRegression()
x_l = [[x0],[x1],[x2]]
y_l = [y0,   y1,  y2 ]
linr_model.fit(x_l, y_l)
m = linr_model.coef_
b = linr_model.intercept_

```

When I ran the above syntax on my laptop, sklearn gave me these values for m and b:

m == array([0.46216216])
b == 1.0031531531531532

I wrote syntax to show the line:

```
from sklearn import linear_model
import matplotlib.pyplot as plt
x0 = 0.0
y0 = 1.0
x1 = 1.0
y1 = 1.5
x2 = 1.10
y2 = 1.48
plt.scatter([x0, x1, x2], [y0, y1, y2])

m = 0.462
b = 1.003

yhat_l = [ m*x_f + b for x_f in[x0, x1, x2] ]
plt.plot([x0, x1, x2], yhat_l)
plt.show()
```

I saw this:

```

# Use Linear Algebra to fit a line to a Scatter Plot

```
We can use Linear Algebra to fit a line to a Scatter Plot.

In Linear Algebra, I describe a line with this simple expression:

y = Xw

Since we have the scatter plot, we know what both y and X are.

In the Scalar world, I find w with Scalar Algebra:

w = y / X

In the world of Linear Algebra, finding w is difficult.

One reason for this is that division of a vector by a matrix is not defined.

Now, I claim this to be a Linear Algebra solution for w:

w =

X' is called: "X Transpose".

What is "X Transpose"?

I explain:

- If X is a row vector then X' is a column vector made from the elements of X.

- Or just visualize X' to be X rotated 90 degrees.

- If X is a matrix, then X' is made from row vectors rotated into column vectors.
- The first row becomes the first column.
- The second row becomes the second column...

X raised to the -1 is called: "Inverse of X".

What is "Inverse of X"?

A better question:

How to calculate "Inverse of X"?

Calculating the Inverse of X is not always easy,
but NumPy offers a method called pinv() which gives us a good approximation.

Now, I can offer some NumPy syntax to fit a line to a scatter plot of three points:
```
import numpy  as np
import pandas as pd
x0 = 0.0;  y0 = 1.0
x1 = 1.0;  y1 = 1.5
x2 = 1.10; y2 = 1.48

# It is easy to build a NumPy array from rows.
# But,
# To build a NumPy array from columns,
# I first create a Pandas DF from columns,
# then convert the DF to array:

col0_l = [1, 1, 1]
col1_l = [x0, x1, x2]
my_df = pd.DataFrame({'col0': col0_l, 'col1': col1_l})
x_a   = np.array(my_df)
y_a   = np.array([y0, y1, y2]).reshape(3,1)

lhs_a = np.linalg.pinv(np.matmul(x_a.T, x_a))
rhs_a = np.matmul(x_a.T, y_a)
w_a   = np.matmul(lhs_a, rhs_a)

```

After I ran the above syntax on my laptop, NumPy gave me this value for w_a:

>>> w_a
array([[1.00315315],
[0.46216216]])
>>>

When I compare w_a to linr_model from sklearn,
I see that:

w_a[0] == linr_model.intercept_
w_a[1] == linr_model.coef_

And, I am satisfied that Linear Algebra can fit a line to a scatter plot as well as sklearn.

```

• Use Linear Algebra to fit a line to GSPC 2016 price points.
• Calculate RMSE for that fitted line.
• Show a plot of the fitted line on top of the 2016 prices.

## Lab (COPY CSV into Postgres Table: prices):

• Describe what this syntax does:
``````
#!/bin/bash

# wgetibm.bash

cd /tmp/
rm -f ibm.csv
/usr/bin/wget https://ml4.herokuapp.com/csv/ibm.csv

/usr/bin/psql -aP pager=no<<EOF
DROP TABLE IF EXISTS prices;

CREATE TABLE prices(
cdate   date
,openp  numeric
,highp  numeric
,lowp   numeric
,closep numeric
,volume numeric
)
;

COPY prices(
cdate
,openp
,highp
,lowp
,closep
,volume
) FROM '/tmp/ibm.csv' WITH csv header
;
EOF

exit
``````
• I ran the above script and saw this:
``````
dan@h79:~/ml4/public/class03demos \$ ./wgetibm.bash
--2017-06-16 19:25:40--  https://ml4.herokuapp.com/csv/ibm.csv
Resolving ml4.herokuapp.com (ml4.herokuapp.com)... 184.168.221.28
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|184.168.221.28|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /csv/ibm.csv [following]
--2017-06-16 19:25:40--  https://ml4.herokuapp.com/csv/ibm.csv
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|184.168.221.28|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ml4.herokuapp.com/csv/ibm.csv [following]
--2017-06-16 19:25:40--  https://ml4.herokuapp.com/csv/ibm.csv
Resolving ml4.herokuapp.com (ml4.herokuapp.com)... 54.243.158.73
Connecting to ml4.herokuapp.com (ml4.herokuapp.com)|54.243.158.73|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957716 (935K) [text/csv]
Saving to: ‘ibm.csv’

ibm.csv             100%[===================>] 935.27K  1.83MB/s    in 0.5s

2017-06-16 19:25:41 (1.83 MB/s) - ‘ibm.csv’ saved [957716/957716]

DROP TABLE IF EXISTS prices;
DROP TABLE
CREATE TABLE prices(
cdate   date
,openp  numeric
,highp  numeric
,lowp   numeric
,closep numeric
,volume numeric
)
;
CREATE TABLE
COPY prices(
cdate
,openp
,highp
,lowp
,closep
,volume
) FROM '/tmp/ibm.csv' WITH csv header
;
COPY 13960
dan@h79:~/ml4/public/class03demos \$
dan@h79:~/ml4/public/class03demos \$
``````
• You should try to get the above script running in your Ubuntu16 host.

## Lab (SQL vs Pandas):

• Describe what this syntax does:
``````SELECT COUNT(cdate) FROM prices;
SELECT COUNT(lowp)  FROM prices;
SELECT COUNT(*)     FROM prices;
SELECT MIN(cdate),MIN(closep),MIN(volume) FROM prices;
SELECT MAX(closep),MAX(volume), AVG(closep),AVG(volume) FROM prices;``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````SELECT * FROM prices WHERE cdate = (SELECT MAX(cdate) FROM prices);
SELECT cdate,closep FROM prices WHERE cdate = (SELECT MAX(cdate)-1 FROM prices);
SELECT cdate,closep FROM prices WHERE cdate > (SELECT MAX(cdate)-10 FROM prices);``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
SELECT max(closep)     FROM
(SELECT cdate,closep FROM prices WHERE cdate > (SELECT MAX(cdate)-10 FROM prices)) iv;

DROP   TABLE IF EXISTS prices2;
CREATE TABLE prices2 AS SELECT cdate,openp,closep FROM prices WHERE cdate > '2016-08-08';

DROP   TABLE IF EXISTS prices3;
CREATE TABLE prices3 AS SELECT cdate,openp,closep, closep - openp AS diff FROM prices2;
``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
INSERT INTO prices3
SELECT cdate,openp,closep, closep - openp AS diff FROM prices
WHERE cdate = '2016-08-01';

INSERT INTO prices3
SELECT cdate,openp,closep, closep - openp AS diff FROM prices
WHERE  cdate BETWEEN '2016-08-01' AND '2016-08-31'
AND    cdate NOT IN (SELECT cdate FROM prices3);

UPDATE prices3 SET diff = 0;
``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
UPDATE prices3 SET diff = 1 WHERE cdate = '2016-08-01';
UPDATE prices3 SET diff = closep - openp;
``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
DELETE FROM prices3 WHERE cdate = '2016-08-10';

DELETE FROM prices3 WHERE cdate < '2016-08-10';

DELETE FROM prices3;
``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
CREATE TABLE prices4 AS
SELECT
extract(year from cdate) yr
,cdate
,closep
FROM prices
WHERE extract(year from cdate) > 1999;

SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr;

SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr ORDER BY yr;
SELECT yr, COUNT(yr) FROM prices4 GROUP BY yr ORDER BY COUNT(yr);
``````
• Write Pandas syntax which mimics the above SQL.
• Describe what this syntax does:
``````
CREATE TABLE prices5 AS
SELECT
cdate
,closep
,AVG(closep)OVER(ORDER BY cdate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) mvgavg5day
FROM prices
WHERE cdate > '2016-08-01'
ORDER BY cdate;

SELECT * FROM prices5 WHERE cdate BETWEEN '2016-08-01' AND '2016-08-31' ORDER BY cdate;
``````
• Write Pandas syntax which mimics the above SQL.

## Lab (NumPy vs Pandas):

• Create a NumPy Array From Pandas DataFrame of GSPC prices.
• Describe what this syntax does:
``````
"""
class03pd44.py

This script should sort df by cdate
"""

import pandas as pd

prices_df = prices_df.sort_values(by=['cdate_s'])

'bye'

``````
• Write NumPy syntax which also sorts by cdate_s.
• Describe what this syntax does:
``````
"""
class03pd45.py

This script should demonstrate flexibility to locate iloc[]
when I want to slice and dice.

Demo:
python class03pd45.py
"""

import pandas as pd
import numpy  as np

# I should get 2 columns and rows 5 through 11:
s1_df = prices_df[['cdate_s','closep']].iloc[5:11]
print(s1_df)

# I do same as above but in different order.
# I should get rows 5 through 11 and 2 columns:
s2_df = prices_df.iloc[5:11][['cdate_s','closep']]
print(s2_df)
# s1_df and s2_df should have same data.

'bye'
``````
• Write NumPy syntax which does something similar
• Describe what this syntax does:
``````
"""
class03pd47.py

This script should slice and dice.

Demo:
python class03pd47.py
"""

import pandas as pd
import numpy  as np

# I should get 2016 July and two columns:
pred_sr = (prices_df.cdate_s > '2016-07') & (prices_df.cdate_s < '2016-08')
s1_df   = prices_df.loc[pred_sr][['cdate_s','closep']]
print(s1_df)

'bye'
``````
• Write NumPy syntax which does something similar
• Describe what this syntax does:
``````
"""

This script should compute columns pctlag1 and pctlead from closep.

Demo:
"""

import pandas as pd

# I should get 2016 July and two columns:
pred_sr = (prices_df.cdate_s > '2016-07') & (prices_df.cdate_s < '2016-08')
s1_df   = prices_df.loc[pred_sr][['cdate_s','closep']]

pctlead = 100 * (s1_df.closep.shift(-1) - s1_df.closep) / s1_df.closep
pctlag1 = 100 * (s1_df.closep - s1_df.closep.shift(1)) / s1_df.closep.shift(1)

s1_df['pctlag1'] = pctlag1

# I should visualize:
print(s1_df)

'bye'
``````
• Write NumPy syntax which does something similar

## Lab (Python + Postgres):

• Test if your copy of Python has Psycopg 2 installed.
• Install Psycopg 2 if it is not installed.
• Write/Run a simple Psycopg 2 script
• Write/Run a script which combines:
• Python
• Pandas
• SQLAlchemy
• CSV data
• Postgres