Class04 Answer:

Run the first example listed in this URL:

http://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param

The syntax in the first example fills only a page.

The author of the example, though, neglects to give information on how to run it.

I ran the example by copying the syntax into a file: /tmp/ml_examp1.scala

Then I added one small, yet significant enhancement.

I enclosed all the syntax within curly-braces.

The curly-braces help spark-shell "see" expressions which span multiple lines.

For example, this call:

lr.setMaxIter(10)
  .setRegParam(0.01)

Will confuse spark-shell because the call spans two lines instead of one.

The resulting file looks like this:


/*
I should enclose all syntax within curly-braces if I want to help
spark-shell "see" expressions which span multiple lines:
*/
{ // Initial curly-brace for spark-shell.

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row

// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")


// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)

// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
val paramMapCombined = paramMap ++ paramMap2

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

// Prepare test data.
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
model2.transform(test)
  .select("features", "label", "myProbability", "prediction")
  .collect()
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
    println(s"($features, $label) -> prob=$prob, prediction=$prediction")
  }
} // Final curly-brace for spark-shell.
/*
I should enclose all syntax within curly-braces if I want to help
spark-shell "see" expressions which span multiple lines.
*/

Next I ran the example with a simple shell command:

~/spark/bin/spark-shell -i ml_examp1.scala

I saw this:


dan@h80:~/ml4/public/class04 $ spark-shell -i ml_examp1.scala
Spark context Web UI available at http://192.168.1.80:4041
Spark context available as 'sc' (master = local[*], app id = local-1515714865688).
Spark session available as 'spark'.
Loading ml_examp1.scala...
LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial. (default: auto)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. (undefined)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)
regParam: regularization parameter (>= 0) (default: 0.0)
standardization: whether to standardize the training features before fitting the model (default: true)
threshold: threshold in binary classification prediction, in range [0, 1] (default: 0.5)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold (undefined)
tol: the convergence tolerance for iterative algorithms (>= 0) (default: 1.0E-6)
upperBoundsOnCoefficients: The upper bounds on coefficients if fitting under bound constrained optimization. (undefined)
upperBoundsOnIntercepts: The upper bounds on intercepts if fitting under bound constrained optimization. (undefined)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undefined)

Model 1 was fit using parameters: {
	logreg_06fbb1c153f1-aggregationDepth: 2,
	logreg_06fbb1c153f1-elasticNetParam: 0.0,
	logreg_06fbb1c153f1-family: auto,
	logreg_06fbb1c153f1-featuresCol: features,
	logreg_06fbb1c153f1-fitIntercept: true,
	logreg_06fbb1c153f1-labelCol: label,
	logreg_06fbb1c153f1-maxIter: 10,
	logreg_06fbb1c153f1-predictionCol: prediction,
	logreg_06fbb1c153f1-probabilityCol: probability,
	logreg_06fbb1c153f1-rawPredictionCol: rawPrediction,
	logreg_06fbb1c153f1-regParam: 0.01,
	logreg_06fbb1c153f1-standardization: true,
	logreg_06fbb1c153f1-threshold: 0.5,
	logreg_06fbb1c153f1-tol: 1.0E-6
}
Model 2 was fit using parameters: {
	logreg_06fbb1c153f1-aggregationDepth: 2,
	logreg_06fbb1c153f1-elasticNetParam: 0.0,
	logreg_06fbb1c153f1-family: auto,
	logreg_06fbb1c153f1-featuresCol: features,
	logreg_06fbb1c153f1-fitIntercept: true,
	logreg_06fbb1c153f1-labelCol: label,
	logreg_06fbb1c153f1-maxIter: 30,
	logreg_06fbb1c153f1-predictionCol: prediction,
	logreg_06fbb1c153f1-probabilityCol: myProbability,
	logreg_06fbb1c153f1-rawPredictionCol: rawPrediction,
	logreg_06fbb1c153f1-regParam: 0.1,
	logreg_06fbb1c153f1-standardization: true,
	logreg_06fbb1c153f1-threshold: 0.55,
	logreg_06fbb1c153f1-tol: 1.0E-6
}
([-1.0,1.5,1.3], 1.0) -> prob=[0.05707304171034022,0.9429269582896597], prediction=1.0
([3.0,2.0,-0.1], 0.0) -> prob=[0.9238522311704104,0.07614776882958961], prediction=0.0
([0.0,2.2,-1.5], 1.0) -> prob=[0.1097277611477944,0.8902722388522056], prediction=1.0

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

scala> :quit
dan@h80:~/ml4/public/class04 $ 
dan@h80:~/ml4/public/class04 $ 
dan@h80:~/ml4/public/class04 $

Near the end of that output I see three obvious predictions from Logistic Regression.

Class04 Lab


learn4.us About Blog Contact Class01 Class02 Class03 Class04 Class05 Class06 Class07 Class08 Class09 Class10 dan101 Forum Google Hangout Vboxen