Reduce GPU costs with on demand instances and startup scripts

This post is about leveraging on demand capabilities of costly virtual instances on the Google Cloud Engine using startup scripts.

Deep Learning is expensive

Here’s the situation: You’re working on some large dataset, and you feel the irresistible urge to release the Deep Learning beast on your models with VMs armed to the teeth with GPUs.

Since your local Macbook steps into the twilight zone every time you launch Keras, you decide to spin up a dragster style, GPU-powered VM on the Google Platform, AWS or Azure. Once the VM is ready, which in truth may take several days if it’s your first encounter with the CUDA Toolkit, you ssh into the VM and start working on your data and your models.

After a few hours of work, you’re still working on your scripts, cleaning up the data, training models, evaluating, and on and on. Time passes on while the earth pursues its never ending spin in the interplanetary void. When you realize your brain has as much jitsu left as a greek yogurt, you decide to call it a day and give the whole thing a rest. And of course, you sometimes forget to stop the instance. Your cash reserves leak out cent after cent, dollar/euro/pound after dollar/euro/pound throughout the night.

All these hours do add up. And at the end of the months you realize GPUs are way more expensive than you ever imagined. But hey Deep learning is really fun. Can’t stop now. Just need a few more hours. After all nobody really understands how these neural network work, do they? And you really need to practice to be able to call yourself a Deep Learning expert. Please just a few more hours of GPUs. Just 30 minutes, … I swear, …. come on!

So what’s a data scientist to do?

One solution is to go back to random forests and SVMs and give up on the whole deep learning thing. After all, as Vladimir Vapnik says, Deep Learning is just brute force training with a whole lot of data.

The other solution is to make the most of the on demand promise in cloud computing.

On demand and serverless

The whole promise of cloud computing is that you can spin up and release resources as needed. Way back in the early 2000s that mostly meant being able to add servers on the fly to support your traffic exploding when BoingBoing or Gizmodo suddenly put your startup on their front page. But for machine learning the same on demand concept is relevant when extra high computing power is needed. When working with Deep Learning, most of the mundane work of data cleaning and shaping can probably be carried out on your local machine or a low level VM. The only time GPU enabled VMs are truly needed is to train the Deep Learning models.

Which means that a resource conservative workflow should look like this:

1) Local: ETL, extract and explore dataset, clean and format data for consumption by Keras / Tensorflow / ….
2) Storage: Upload the properly formatted data to storage (GCS, S3, SQL, BQ) to make accessible by the VM
3) Local:Create a GPU enabled VM
4) EC2, GCE: run your Keras / TF / Pytorch script to train your model
5) EC2, GCE: Store intermediate and final results to storage, so you can access it from local
6) Shutdown or delete the VM when the script has finished running
7) Local: retrieve the models, make some predictions, evaluations, selections

Here, Local can be replaced by a smaller, less powerful VM running on CPUs and not GPUs.

With this workflow, you only spend money on expensive cloud resources on steps 3) and 4), potentially saving you significant amounts of cash at the end of the day. If you can manage to shutdown or even delete the VM once the script has finished running then you won’t even run the risk of leaving it running all through the night! Brilliant!

So in order to limit our resources, we need to be able to

1) Create a VM in the fly
2) Run a script on the VM from your local
3) Terminate the VM when the script is done

And we should do that (create, run, terminate) every time we want to test a new version of the dataset, a new DL architecture or new parameters. We could even potentially run several trials in parallel.

Let’s go!

Create a VM on the fly from an existing disk

I assume here that you already have created a VM and installed everything needed to run your scripts. Things like the conda distribution, scikit-learn, keras, GPUs and so forth. See How to setup a VM for data science on GCP and Launch a GPU-backed Google Compute Engine instance for more details (also this one) to install the CUDA Toolkit and the cuDNN library.

For those in a hurry, here’s an example of the command line for creating a preemptible VM on Google Cloud Engine (Ubuntu 17.10 with 50gb disk space). Preemptible VMs are temporary but way way cheaper than non preemptible ones. Great for research work not so much for production APIs.

gcloud beta compute --project "<project_name>" instances create "<instance_name>" --zone "us-east1-c" \
--machine-type "n1-standard-1" --subnet "default" --no-restart-on-failure --maintenance-policy "TERMINATE" \
--preemptible --service-account "<your_service_account>"  --image "ubuntu-1710-artful-v20180126" \
--image-project "ubuntu-os-cloud" --boot-disk-size "50" --no-boot-disk-auto-delete \
--boot-disk-type "pd-standard" --boot-disk-device-name "<disk_name>"

In the above and quite longish command, the non obvious but important flags are

--no-boot-disk-auto-delete: The disk will not be deleted when the instance is deleted
--preemptible: makes the VM temporary and saves money
--service-account "<your_service_account>": A service account is used by your VM to interact with other Google Cloud Platform APIs. The default service account is identifiable with that email [PROJECT_NUMBER][email protected] where the [PROJECT_NUMBER] can be found on your project dashboard

So let’s assume that your have created your initial disk and deleted the associated VM. The disk is now free to be used to create a new VM.

The following command line creates a preemptible VM from that disk

gcloud compute instances create <instance name> --disk name=<disk name>,boot=yes --preemptible

Ok so we can create a VM on the fly based on that disk. Now we want to run a script on that VM. Let’s say a python script.

The simplest way is to use the SSH command with the --command flag

gcloud compute ssh <instance name> \
--command '{Absolute path to }/python {absolute path to}<the script>.py'

and low and behold, that command will display the output of the remote python script on your local terminal. Try for instance gcloud compute ssh <instance name> --command 'ls -al'.

A more sophisticated way is to automatically run a script (shell, python, R, whatever…) when the VM is created by using startup scripts

For instance, we could want to run the following shell script on creation

#! /bin/bash
sudo apt-get update

printf '%s %s Some log message \n' $(date +%Y-%m-%d) $(date +%H:%M:%S) >> '{absolute_path}/startup_script.log'

# add and activate the github keys
eval "$(ssh-agent -s)"
ssh-add  {path to github key}

# log script start
cd {path to application folder}

# git update application
git pull origin master
# run script
{path to }/python {absolute path to}/<the script>.py

That script activates the github keys, updates the application from github, and runs the script. Pretty neat. Creating the VM and making sure the VM runs that script is just an extension of the above VM creation command line But first you need to make the script available to the VM by uploading to Google Storage with gsutil

gsutil cp {local}/<shell script> gs://{bucket name}/

for more gsutil example check my post on https://alexisperrier.com/gcp/2018/01/01/google-storage-gsutil.html.

And now we can create the VM and have the script run on start. The following command line will attach the script to the VM as a startup script. Every time the VM is started the script will be executed.

gcloud compute instances create <instance name> --disk name=<disk name>,boot=yes --preemptible \
--scopes storage-ro \
--metadata startup-script-url=gs://{bucket name}/<shell script>  --preemptible

Now shutdown that VM!

So we are now able to

start a VM on demand from a disk
run a script on that VM from our local

We just need to find a way to terminate that VM once the script has finished running. This is where things get tricky mainly because it can be difficult to know with certainty when the script has finished running.

The simplest solution is to add the following shutdown line at the end of the startup script

shutdown -h now

This forces the VM to stop once the script is done. The VM still exists and is not deleted. In terms of pricing, this might just be enough as the VM is not priced when it’s idle. From the google doc: Instances that are in a TERMINATED state are not charged …. The associated disks, IPs, and other resources are still billed but not the VM.

It’s also possible to not only shutdown the VM but also delete it from within the VM. In other words having the VM commit seppuku. I mean specially if the script fails to run, the VM should take the blame for having failed its master (You) and rightly self suicide, makes sense. In a way maybe. The following code is from this thread. It can run from within the VM. It requires a bit of tuning on the service account scopes for it to have the proper permissions and actually work.

gcloud compute instances delete $myName --zone=$myZone --quiet

where the name of the VM comes from myName=$(hostname) and the zone from

# Get the zone
zoneMetadata=$(curl "https://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
myZone="${zoneMetadataSplit[3]}"

The only problem with that approach (stopping or deleting from the startup script) is that every time you start the VM, well, it will run the script and shutdown, and at the same time preventing you from ssh’ing into it to check the logs, make some modifications or inspect the results. Bummer!

The other solution would be to have your main model training script write a status update to a file or an external storage bucket or database, and from your local, regularly check the status of the script before deciding to the the shutdown command.

In the end, although a bit hacky, I think this is the best solution. Your whole workflow becomes now

Create the VM and set it up with all the right libraries and GPUs. Make sure the disk is persistent.
Delete the VM
Write the startup script that runs your model training python code. Make sure the python code exports its status (running, failed, done) to some external resources or some internal file as its being executed. Upload the startup script to a bucket on Google Storage
Create (1st time) / start (subsequent times) the VM and associate the startup script to it
Have a local conductor shutdown script regularly check the status of the training python execution on the VM and shutdown the VM when the status is failed or done

Note: Instead of using a startup script, you could also simply run the python model code using the gcloud compute ssh -- command <python script> command in conjunction with the local conductor shutdown script. But I feel the use of start up scripts basically dedicates the VM to that usage and that usage only. Similarly to writing good quality code, where a method or function should do only one thing at a time, my feeling is that a VM should be used only for one goal and one goal only. After all you can have as many VMs as you like as long as they are idle. Disks prices are low and usually not a problem.

Conclusion

The whole point behind using cloud resources is to leverage its at-will / on-demand capabilities to reduce costs. Doing so requires using startup scripts and some external monitoring to shutdown the VM once the task has been completed.

This is of course just one way of doing things.

Please let me know in the comments, how YOU manage costly on demand instances. And thanks for reading this post until the end.