Google storage is a file storage service available from Google Cloud. Quite similar to Amazon S3 it offers interesting functionalities such as signed-urls, bucket synchronization, collaboration bucket settings, parallel uploads and is S3 compatible. Gsutil, the associated command line tool is part of the gcloud command line interface.

After a brief presentation of the Google Cloud Storage service, I will list the most important and useful gsutil command lines and address a few of the service particularities.

Google storage

Google Cloud Storage Icon The google storage platform is Google’s Entreprise storage solution. Google Storage offers a classic bucket based file structure similarly to AWS S3 and Azure Storage. Google Storage was introduced in may 2010 as Google Storage for Developers, a RESTful cloud service limited at the time to a few hundreds developers. gsutil the command line tool associated with Google Storage was released at the same time.

Fast forward to 2018, Google Storage now offers 3 levels of storage with different accessibility and pricing.

  • standard storage is for fast access to large amounts of data. It offers high speed of response to requests.
  • DRA is for long-term data storage and infrequent access and is priced lower than standard storage.
  • Nearline storage is for even less frequent access and offers longer response times. It is the cheapest option.

Google Storage price structure depends on location and storage class and evolves frequently. At time of writing prices are $0.026 per Gb-month for Standard , $0.01 for Nearline and as low as $0.007 for Coldline storage with a Multi-regional location. See the pricing page for uptodate prices. See also Google Cloud Storage on a shoestring budget for an interesting cost breakdown.

A distinct trait of Google Storage structure is that folders and subfolders within a bucket are not associated with a “physical” structure as they would be on your local machine. On Google Storage, buckets have virtual folders. The full path to a file is interpreted as being the entire filename. Consider for instance, the file hello_world.txt located in mybucket/myfolder/. The file’s URL is: gs://mybucket/myfolder/hello_world.txt. Google Storage interprets that file has having the filename myfolder/hello_world.txt.

The slash / character is part of the object filename instead of being an indication of an existing folder. As Google calls it, this object naming scheme creates ” the illusion of a hierarchical file tree atop the “flat” name space”.

Although this is transparent most of the time, virtual paths may results in misplaced files when uploading a folder with multiple subfolders. If the upload fails and needs to be restarted, the copy command will have unexpected results since the folder did not exist in the first upload but does with the second try.

In order to avoid these weird cases, the best practice, is to make sure to start by creating the expected folder structure and only then upload the files to their target folders.

gsutil

Gsutil is the command line tool used to manage buckets and objects on Google Storage. It is part of the gcloud shell scripts. Gsutil is fully open sourced on github, and under active development.

Gsutil goes well beyond simple file transfers with an impressive lists of advanced gsutil features, including:

  • ACLs: setting access control via Access Control Lists
  • rsync: synchronizing folders and buckets
  • lifeline: defining lifecycle rules
  • signed urls: setting time limited online access ()
  • perfdiag for troubleshooting
  • logging, notifications and versioning

Before diving in these powerful functionalities, let’s walk through a simple case of file transfer. If you don’t have gsutil installed on your local machine or cloud instance, follow the Google Cloud SDK install instructions for your OS in order to get started. You may need to sign up for a free trial account.

Getting around with gsutil

In the following examples, I create a bucket, upload some files, get information on these files, move them around and change the bucket storage class.

  • First things first. In order to get help on gsutil or any gsutil sub commands:
$ gsutil help
$ gsutil <command> help
  • Now create a bucket named <bucketname>

All buckets names share a single global Google namespace and must not be already taken.

$ gsutil mb gs://<bucketname>

Note that there are certain restrictions on bucket naming and creation beyond the uniqueness condition. For instance you cannot change the name of an existing bucket, and a bucket name cannot include the word google.

  • Upload and download a file with cp
$ gsutil cp <local_file> gs://<bucketname>/
$ gsutil cp  gs://<bucketname>/<remote_file> ./
  • And transfer a file between buckets:
$ gsutil cp  gs://<bucket_A>/<remote_file> gs://<bucket_B>/
  • Create a folder in a bucket with cp
$ gsutil cp <new_folder> gs://<bucketname>/
  • Upload a file to a <new_folder> with cp
$ gsutil cp <local_file> gs://<bucketname>/<new_folder>/

This will create the folder <new_folder> and at the same time upload the file <local_file> to that folder. Note the trailing / that tells gsutil to actually interpret <new_folder> as a new folder and not as the target filename. If you omit the trailing / gsutil will rename the file with the filename <new_folder> once uploaded and the new folder will not be created.

  • List the folder with ls
$ gsutil ls gs://<bucketname>/
  • Check storage space with du
$ gsutil du -h gs://<bucketname>/

where the -h flag makes it human readable

The -r and -m flags

  • Copy a local folder and its content to a bucket with cp -r
$ gsutil cp -r ./<local_folder> gs://<bucketname>/

Consider for instance a local ./img directory that contain several image files. We can copy that entire local directory and create the remote folder at the same time with the following command:

$ gsutil cp -r ./img gs://<bucketname>/

The bucket now has the virtual folder /img.

  • Improve performance with the -m flag

When moving large number of files, adding the -m flag to cp will run the transfers in parallel and significantly improve performance provided you are using a reasonably fast network connection.

  • Wildcards

gsutil supports * and ? wildcards only for files. To include folders in the wildcard target you need to double the * or ? sign. For instance, gsutil ls gs://<bucketname>/**.txt will list all the text files in all subdirectories. The wildcard page offers more details.

Gsutil full configuration

Gsutil full configuration is available in the ~/.boto file. You can edit that file directly or via the gsutil config command. Some interesting parameters are:

  • parallel_composite_upload_threshold: to specify the maximum size of a file to be uploaded in a single stream. Files larger than this threshold will be uploaded in parallel. The parallel_composite_upload_threshold parameter is disabled by default.

  • check_hashes: to enforce integrity checks when downloading data, always, never or conditionally.
  • prefer_api: to specify the API to use when interacting with cloud storage providers (S3, GCS, …)
  • and aws_access_key_id and aws_secret_access_key for interoperability with S3.

Cloud storage compatibility is powerful. Not only can you migrate easily from AWS S3 to GCP or vice versa but you can also sync S3 buckets and GCP buckets with the rsync command.

ACL

As stated in the documentation, Access Control Lists (ACLs) allow you to control who can read and write your data, and who can read and write the ACLs themselves. ACL are assigned to objects (files) or buckets. By default all files in a bucket have the same ACL as the bucket they’re in.

ACL has 3 commands

  • GET: lists the permissions on a given object. For instance gsutil acl get gs://<bucketname>/ outputs the access settings for the <bucketname> bucket.
  • SET: sets the permissions on a given object. The best way to set the permissions and avoid mistakes is by first exporting them to a file with gsutil acl get gs://<bucketname>/<filename> act.txt, modify the acl.txt file and then set the new permissions with gsutil acl set acl.txt gs://bucket/<filename>
  • CH: for change, modifies the current permissions on a given object. For instance to grant WRITE access to a user gsutil acl ch -u [email protected]:WRITE gs://<bucketname>/

The default settings for buckets are defined with the defacl command which also responds to get, set and ch subcommands. The command gsutil defacl get gs://<bucketname>/ will return the default settings for the bucket <bucketname>.

Several pre defined setings are available:

  • project-private: is the default setting for new objects. It gives permission to the project team based on their roles. All team members have READ permission while editors and owners have OWNER permission.
  • private: Gives the requester OWNER permission for a bucket or object
  • public-read: Opens the objects to the whole internet as it gives all users read permission.
  • public-read-write: The dangerous setting that allows anyone on the internet to upload files to your bucket.

Further ACL details are available in the ACL page

rsync

The gsutil rsync makes the content of a target folder identical to the content of a source folder by copying, updating or deleting any file in the target folder that has changed in the source folder. This synchronization works across local and GCP folders as well as other gsutil cloud compatible storage solutions such as AWS S3. With the gsutil rsync command you have everything you need to create an automatic backup of your data in the cloud. The rsync command follows:

$ gsutil rsync <source folder> <target folder>

Consider a local folder ./myfolder and the <bucketname> bucket, the following command synchronizes the content of the local folder with the storage bucket:

$ gsutil -m rsync -r -d ./myfolder gs://<bucketname>

The content of gs://<bucketname> will match the content of your local ./myfolder directory, effectively backing up the local documents.

  • Note the presence of the -r flag which ensures that all subfolders are matched.
  • The -d flag is to be used with caution as it will delete the content in the target when deleted from the source. If you inadvertently make a mistake in your command, for instance inverting the source and target folders, you may end up deleting your content. A good way to ensure that does not happen is to enable bucket versioning.

If you don’t want to have to run the gsutil command every time you make a change in the source folder, you can set up a cron job on your local with crontab -e or the equivalent for windows machines. For instance the following cron job will backup your local folder to Google Cloud every 15mn.

\*/15  * * * * gsutil -m rsync -r -d < full path to myfolder> gs://mybucket >> <full path to log file>

Bucket versioning

Bucket versioning is a powerful feature that prevents any file deletion by mistake. Enabling and disabling versioning is done at the bucket level with the command:

$ gsutil versioning set on gs://<bucketname>
$ gsutil versioning set off gs://<bucketname>

When versioning is enabled on a bucket, objects become accessible by specifying their version number. Listing the content of a bucket will show the version numbers of its objects as such:

gs://<bucketname>/<filename>#<version_number_1>
gs://<bucketname>/<filename>#<version_number_2>
gs://<bucketname>/<filename>#<version_number_3>
...

To retrieve the correct version, simply append the version number to the object name in the cp command.

The object versioning page offers more details on the subject.

Signed URLs

Signed URLs is a mechanism for query string authentication for buckets and objects. In other words, Signed urls provide a way to give time-limited read or write access to anyone in possession of the URL, regardless of whether they have a Google account. To create a signed url you first need to generate a generate a private key following these instructions. Click on Create a service account key, select your project, and download the JSON file that contains your private key.

You can now create a signed urls for one of your file with

$ gsutil signurl -d 10m -m GET <path/private_key.json>  gs://<bucketname>/<filename>

Note that signed urls do not work on directories. If you want to give access to multiple files you can use wildcards. For instance the following command will give access for 10 minutes on all the png files in the gs://<bucketname>/img/ folder.

$ gsutil signurl -d 10m -m GET <path/private_key.json> gs://<bucketname>/img/*.png*

Check the signed urls page for more info

Service accounts

Service accounts are special accounts that represent software rather than people. They are the most common way applications authenticate with Google Cloud Storage. Every project has service accounts associated with it, which may be used for different authentication scenarios, as well as to enable advanced features such as Signed URLs and browser uploads using POST.

When you use a service account to authenticate your application, you do not need a user to authenticate to get an access token. Instead, you obtain a private key from the Google Cloud Platform Console, which you then use to send a signed request for an access token. You can then use the access token like you normally would. For more information see the Google Cloud Platform Auth Guide.

Lifecycle

Lifecycle configurations allows you to automatically delete or change the storage class of objects when some criterion is met.

To enable lifecycle for a bucket with settings defined in the config_file.json file, run:

$ gsutil lifecycle set <config_file.json> gs://<bucket_name>

For instance, in order to delete the content of the bucket after 30 days, the config file would be: Example: delete after 10 days

{
    "lifecycle": {
        "rule": [
            {
                "action": {"type": "Delete"},
                "condition": {
                    "age": 30,
                    "isLive": true
                }
            }
        ]
    }
}

While changing storage class of a bucket to Nearline after a year would be:

{
    "action": {
        "type": "SetStorageClass",
        "storageClass": "NEARLINE"
    },
    "condition": {
      "age": 365,
      "matchesStorageClass": ["MULTI_REGIONAL", "STANDARD", "DURABLE_REDUCED_AVAILABILITY"]
    }
}

Check the lifecycle configurations page for more info.

Conclusion

Google Cloud Storage is a fully featured enterprise level service which offers a viable alternative to AWS S3. Prices, scalability, and reliability are key features of the service. I’ve been using Google Storage for awhile across different projects and find it very user friendly. Definitely worth testing if you need to store significant amount of data.

Further readings