### Large Data with Scikit-learn
* Alexis Perrier - [@alexip](https://twitter.com/alexip)
* Data & Software - [@BerkleeOnline](https://twitter.com/berkleeonline) - Day
* Data Science contributor - [@ODSC](https://twitter.com/odsc) - Night
### Plan
1) What is large data?
out-of-core, streaming, online, batch?
2) Algorithms
3) Implementation
4) Examples
### Many great alternatives
* Dato: [GraphLab create](https://dato.com/products/create/)
* [H20.ai](https://www.h2o.ai/)
* memory optimized [AWS instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html)
* [pySpark MLlib](https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html)
* Text: [Gensim](https://radimrehurek.com/gensim/tutorial.html) [py]
* R: [SGD streaming package](https://github.com/airoldilab/sgd)
### Terminology
* **Out-of-core / External memory**
Data does not fit in main memory => access data stored in slow data store (disk, ...), slower (10x, 100x)
* **Offline**: all the data is available from the start
* **Online**: Serialized. Same results as **offline**. Act on new data right away
* **Streaming** Serialized. limited number of passes over the data, can postpone processing. "old" data is of less importance.
* **Minibatch**, Serialized in blocks
* **Incremental** = online + minibatch
https://stats.stackexchange.com/questions/22439/incremental-or-online-or-single-pass-or-data-stream-clustering-refers-to-the-sam
clf = Some Model or Transformation
- Train on training data X_train, y_train
> clf.**fit**(X_train,y_train)
- Predict on test data: X_test => y_test
> y_test = clf.**predict**(X_test)
- Assess model performance on test: y_truth, vs y_test
> clf.**score**(y_truth, y_test, metric = ...)
- Predict on new data: \\( \hat{y} = clf.predict(X_{new}) \\)
### Scikit-learn: out-of-core
* Split the training data in blocks: mini-batch (sequential, random, ...)
* Load each block in sequence
* Train the model adaptively on each block
* Convergence!
### Scikit-learn minibatch learning
* Batch size **n**
* split the data in **N** blocks
* **.partial_fit** instead of **.fit**
> clf.**partial_fit**(X_train(i),y(i), all_classes)
* All possible classes given to partial_fit on first call
* i = 1..N number of blocks
### Scikit: Implementation
Implementation with generators
Generator code
[better example](https://adventuresindatascience.wordpress.com/2014/12/30/minibatch-learning-for-large-scale-data-using-scikit-learn/)
#### Other loss functions
* GD: loss = MSE
* Passive Aggressive: Hinge loss
* Perceptron: loss = +/- 1
###### Mini Batch K-Means
D. Sculley @Google, 2010, [Web Scale K-Means Clustering](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)
* Each mini-batch
* Each sample \\( x \\) in the batch
* Find closes center \\( C_{x} \\) (min dist)
* Each sample \\( x \\) in the batch
* Update \\( C_{x} \\) with
$$ C_x = (1 - \frac{\mu}{|C_x|}) C_x + \mu x $$
* Classification: Mini-batches need to be balanced for classes
* Perceptron, PA and SGD put different emphasis on samples over time.
* Batch size influences?
### Batch size
The bigger the better? until you run out of memory?
* **time efficiency** of training (small batch) VS **noisiness of gradient** estimate (large batch)
Adding some noise (small batch) can be interesting to get out of local minimas
In practice
* small to moderate mini-batch sizes (10-500)
* decaying learning rate
https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-gradient-descent
### Part II Data Munging (next time)
* Loading csv files
* Parsing, segmenting,
* word counts, TfIdF, hashing vectorizer
=> time consuming
### Dask
* Continuum IO (same people who brought you Conda)
* Easy install pip or conda (no JVM like spark)
* Blocked algorithm
* Copies the numpy interface
* Very little code changes
* [Matthew Rocklin - PyData 2015](https://www.youtube.com/watch?v=ieW3G7ZzRZ0)
##### Links
* https://scikit-learn.org/stable/modules/scaling_strategies.html
* https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py
* Memory imprint of Python Dict https://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries/
* Dask https://github.com/dask/dask-examples/blob/master/nyctaxi-2013.ipynb
* scikit learn https://fa.bianp.net/talks/sheffield_april_2014/#/step-19
* https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
* https://github.com/scikit-learn/scikit-learn/issues/4834
* https://github.com/szilard/benchm-ml/blob/master/0-init/1-install.txt
* https://cs229.stanford.edu/materials.html
* https://stackoverflow.com/questions/15036630/batch-gradient-descent-with-scikit-learn-sklearn
* https://sebastianruder.com/optimizing-gradient-descent/
* [Passive Aggresssive](https://www.youtube.com/watch?v=TJU8NfDdqNQ)