How to build a data lake on Google Cloud Platform: Cloud Summit 2019
by Tristan Van Thielen, on Nov 6, 2019 7:30:00 AM
A few weeks ago I attended the Google Cloud Summit in Amsterdam. It was my first Google event and I had a very cool experience together with my Fourcast colleagues. The day was filled with talks on interesting topics as well as stories from Google Cloud customers. The goal of this event was to learn about new Google technologies and to get inspired by the experiences of others. One of the talks that I found most interesting was “Building a data lake on Google Cloud Platform”. When I got home I started doing some digging on the internet and I will share what I found about how to build a data lake on Google Cloud Platform (GCP) below.
Note that this is by no means a thorough guide on how to build a data lake on Google Cloud Platform, but it should give you an idea.
Data lakes, data warehouses and everything in between
If you have been keeping up with everything about data lately, then you will surely have heard terms like ‘data lake’, ‘data warehouse’, ‘data mart’, etc.
But what exactly is the difference between them? Well, there is some discussion around this topic, so I will share what Google has to say about this.
What is a data warehouse?
A data warehouse is a collection of all of the structured data that a company has in one place. Why is this useful? Having all your data in one place means that it is easy to combine.
Data is no longer in separate silos and the time to insight (TTI) is decreased. The goal is not only to have all your data in one place, but also to have a clear view of the lineage of your data.
This way you know exactly where your data comes from and you can validate the results from an analysis by verifying that the data is correct.
What is a data mart?
A data mart is a subset of a data warehouse. The goal is to have all the information related to one subject in one place. This makes it easy for end users to use the data and allows them to work on a small subset, thereby limiting the resources used.
Additionally, this enhances security as you’re exposing as little information as needed for each specific use case.
What is a data lake?
The benefits of having a data warehouse and data marts are clear. So why do you need a data lake? What is a data lake, even?
Well, a data lake is similar to a data warehouse, but it is broader. The two can (and should) be used alongside each other.
A data lake includes all unstructured information like reports, pictures, text files and any information you can store. This includes the structured data that you would normally put directly into your data warehouse. The idea is that data might not seem useful now, but it might well be in a few months. Storage in the Cloud is cheap, so why would you want to miss that opportunity?
Important to note is that data in your data lake should be raw and unprocessed. The refinement is done when presenting to the different serving layers (like a data warehouse) that your business has. While some of the serving layers might change over time, the existing information in the data lake remains untouched. The main idea is to add rather than overwrite or change.
Image: all rights reserved to Holistics.io at https://www.holistics.io/blog/data-lake-vs-data-warehouse-vs-data-mart/
With all this data you can do a lot of things. For example, instead of importing your structured data directly in your data warehouse, you now first store it in your data lake, then enrich it with the information you can gain from the unstructured data and put that in your data warehouse.
You can read an example of how to do this further below.
Why a data lake?
This is of course just one example. Think about what you could do in your company if you have all your information in one place! As some of you may have heard by now, around 80% of all data comes in unstructured form. As you can guess, a lot of useful information can be hidden in this data.
Why did we not do this before? Well, unstructured data is exactly what it sounds like. Not well structured. This means that it has always been hard to analyse. Plain text information used to have to go through manual analysis before any useful information could be gained. This was both time consuming and expensive.
However, now there are many machine learning techniques that can analyse plain text as well as images, videos etc. This means that extracting the information can be automated. Incorporating this extra information can heavily impact the efficiency of your organisation and can improve customer experience.
How to build your own data lake?
Now that we are clear on the different terms and we know what we can gain, let’s talk about how you can get a data lake on Google Cloud Platform (GCP). As you can see in the diagram below, the ideal service to start your data lake is Google Cloud Storage.
Cloud Storage offers unlimited object storage for files of up to 5TB a piece. It has different storage classes, so you can store your rarely accessed information cheaper and it has good SLAs with regards to durability and availability. Cloud Storage integrates nicely with all other GCP services , so you can leverage them from the moment your data arrives.
In case you are not ready to make the full switch to cloud native, GCP can even be used to replace your HDFS (in fact both are based on a white paper from Google on GFS). This means that your applications will not have to be drastically changed to work with your data lake on Cloud Storage.
So what are the main steps to get the benefits of a data lake, as explained above?
- Gather all of your raw data in Cloud Storage
- Use tools like the messaging service Cloud Pub/Sub or beam pipelines in Cloud Dataflow to extract, transform and load (ETL) data from Cloud Storage to a different serving layer (or back to Cloud Storage). Leverage the fact that you can access all the information of your company in one place and think creatively about what you can do with your unstructured data.
- Leverage this data in your serving layers. An example for such a serving layer could be a data mart. Present the data in the data mart to its end users and allow them to gain better insights with tools like Data Studio and Looker. Use this to improve your customer experience, gain profit and make your organisation more efficient.
A concrete data lake use case example on Google Cloud Platform
Let’s get into a more concrete example for a fictional business called Rainforest. Rainforest has an online store with a lot of products. They have a large amount of concurrent users from all over the globe. To ensure that they do not sell more than they have in stock, they use a global transactional SQL based database (like Cloud Spanner).
Users that have bought an item, can leave a review in the form of a score and plain text commentary. Right now, Rainforest has a table in their data warehouse (let’s say in BigQuery) that has all their products and associated scores. They are not using the text associated with that score.
Rainforest has noticed that users do not always give a score that correlates to their comment. For example a user might say: “Best product I’ve ever bought” and still only rate the product one star out of five.
Besides that, some customers also send text feedback via mail. Because of this, Rainforest has decided to level up their game. They want to start a test project for a data lake, so they dump all their review data (comments and emails) in Cloud Storage. They use a Dataflow pipeline to extract the text and call the Natural Language API provided by Google to get the sentiment for the review. They then add this sentiment together with the other information about the purchase into their data warehouse in BigQuery.
Rainforest no longer needs to rely solely on the amount of stars a product has received to make a recommendation, but they can use the sentiment of a review to create a new score that will be used by their recommendation engine. This gives them an edge over their competitors and drives their business forward.
This was just a small example of what you can do with a data lake. Of course, the exact implementation will vary from use case to use case, but the principle remains the same. Store all of your information in one place in its raw form, combine it in useful ways and use these to improve your business. Your company probably has a lot of data that it isn’t leveraging right now, so I hope this blog has inspired you in the same way that I was by the talk I attended at the Google Cloud Summit in Amsterdam. Start thinking about what you can do with your data and drive your business forward!
Want to get more insights in how a data lake on Google Cloud works and how it could benefit your use case?
Discover more in our Data Lake Inspiration Guide!