### Large Data with Scikit-learn * Alexis Perrier - [@alexip](https://twitter.com/alexip) * Data & Software - [@BerkleeOnline](https://twitter.com/berkleeonline) - Day * Data Science contributor - [@ODSC](https://twitter.com/odsc) - Night
### Plan 1) What is large data? out-of-core, streaming, online, batch? 2) Algorithms 3) Implementation 4) Examples
### Many great alternatives * Dato: [GraphLab create](https://dato.com/products/create/) * [H20.ai](https://www.h2o.ai/) * memory optimized [AWS instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html) * [pySpark MLlib](https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html) * Text: [Gensim](https://radimrehurek.com/gensim/tutorial.html) [py] * R: [SGD streaming package](https://github.com/airoldilab/sgd)
### Terminology * **Out-of-core / External memory** Data does not fit in main memory => access data stored in slow data store (disk, ...), slower (10x, 100x) * **Offline**: all the data is available from the start * **Online**: Serialized. Same results as **offline**. Act on new data right away * **Streaming** Serialized. limited number of passes over the data, can postpone processing. "old" data is of less importance. * **Minibatch**, Serialized in blocks * **Incremental** = online + minibatch https://stats.stackexchange.com/questions/22439/incremental-or-online-or-single-pass-or-data-stream-clustering-refers-to-the-sam
clf = Some Model or Transformation - Train on training data X_train, y_train > clf.**fit**(X_train,y_train) - Predict on test data: X_test => y_test > y_test = clf.**predict**(X_test) - Assess model performance on test: y_truth, vs y_test > clf.**score**(y_truth, y_test, metric = ...) - Predict on new data: \\( \hat{y} = clf.predict(X_{new}) \\)
### Scikit-learn: out-of-core * Split the training data in blocks: mini-batch (sequential, random, ...) * Load each block in sequence * Train the model adaptively on each block * Convergence!
### Scikit-learn minibatch learning * Batch size **n** * split the data in **N** blocks * **.partial_fit** instead of **.fit** > clf.**partial_fit**(X_train(i),y(i), all_classes) * All possible classes given to partial_fit on first call * i = 1..N number of blocks
### Scikit: Implementation Implementation with generators Generator code [better example](https://adventuresindatascience.wordpress.com/2014/12/30/minibatch-learning-for-large-scale-data-using-scikit-learn/)
# II Algorithms
#### Other loss functions * GD: loss = MSE * Passive Aggressive: Hinge loss * Perceptron: loss = +/- 1
###### Mini Batch K-Means D. Sculley @Google, 2010, [Web Scale K-Means Clustering](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) * Each mini-batch * Each sample \\( x \\) in the batch * Find closes center \\( C_{x} \\) (min dist) * Each sample \\( x \\) in the batch * Update \\( C_{x} \\) with
$$ C_x = (1 - \frac{\mu}{|C_x|}) C_x + \mu x $$
* Classification: Mini-batches need to be balanced for classes * Perceptron, PA and SGD put different emphasis on samples over time. * Batch size influences?
### Batch size The bigger the better? until you run out of memory? * **time efficiency** of training (small batch) VS **noisiness of gradient** estimate (large batch) Adding some noise (small batch) can be interesting to get out of local minimas In practice * small to moderate mini-batch sizes (10-500) * decaying learning rate https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-gradient-descent
# III Examples
### Part II Data Munging (next time) * Loading csv files * Parsing, segmenting, * word counts, TfIdF, hashing vectorizer => time consuming
### Dask * Continuum IO (same people who brought you Conda) * Easy install pip or conda (no JVM like spark) * Blocked algorithm * Copies the numpy interface * Very little code changes * [Matthew Rocklin - PyData 2015](https://www.youtube.com/watch?v=ieW3G7ZzRZ0)
# Thank you @alexip [email protected] https://alexisperrier.com
##### Links * https://scikit-learn.org/stable/modules/scaling_strategies.html * https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py * Memory imprint of Python Dict https://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries/ * Dask https://github.com/dask/dask-examples/blob/master/nyctaxi-2013.ipynb * scikit learn https://fa.bianp.net/talks/sheffield_april_2014/#/step-19 * https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py * https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html * https://github.com/scikit-learn/scikit-learn/issues/4834 * https://github.com/szilard/benchm-ml/blob/master/0-init/1-install.txt * https://cs229.stanford.edu/materials.html * https://stackoverflow.com/questions/15036630/batch-gradient-descent-with-scikit-learn-sklearn * https://sebastianruder.com/optimizing-gradient-descent/ * [Passive Aggresssive](https://www.youtube.com/watch?v=TJU8NfDdqNQ)