Introducing Matillion Data Loader to load data from GCP into SnowflakeAs you might know I am managing the Data & Analytics unit at Pong. One of my responsibilities is setting the direction of this unit. One of our focus points is Modern Cloud Analytics. Therefor we have setup a few partnerships focussed on the Snowflake Eco-system. You can find out more about these partnerships; here. Unfortunately only in Dutch, but Google Translate might help a bit. Matillion is one of those partners from the Snowflake Eco-system.
A few weeks ago I found out that Matillion is launching Matillion Data Loader (MDL); a free Software-as-a-Service (SaaS) Data Integration solution.
“Code-free, Wizard-based Pipeline builder”
With Matillion Data Loader you can:
- Easily and quickly load data into a Cloud Data Warehouse (Snowflake, Google Big Query and Amazon Redshift) using a Code-free, Wizard-based Pipeline builder,
- Using Scale & Speed capabilities in a Serverless environment,
- Monitoring jobs from a central dashboard,
- Upgrade to Matillion ETL for Advanced Data Transformation, Data Integration, and Job Orchestration.
I was one of the lucky ones to receive an invite to play around with Matillion Data Loader. Starting was relatively easy;
- Sign-up and activate my account,
- Start building Data Pipelines
- Start loading Data into Snowflake.
From Source to Cloud Data Warehouse in a few clicks
In this blog I will show you my first steps on the Matillion Data Loader path. I will use MDL to load data from an Excel-sheet into Snowflake (SF). The Excel-sheet is located in a Google Cloud Platform (GCP) bucket.
GCP Cloud Credentials
Matillion Data Loader must be able to select the Excel-sheet from the GCP-bucket. There for it is necessary to specify the Google Cloud Credentials. These can be downloaded from GCP. Follow the steps in the; Getting Started with Authentication and download the JSON-file to your local environment. In MDL you can manage the GCP Cloud Credentials and upload the JSON-file.
Snowflake Destination Configuration
To be able to load data into Snowflake, you need to make sure that MDL can connect to Snowflake with the correct credentials with enough privileges (e.g. ‘CREATE VIEW’-privileges in SF).
The two endpoints are in place to setup. There is a GCP-bucket where the Excel-file lives and where we have access to. At the other end, there is a Snowflake Database with a Schema where we are allowed to create a table and load data into. What’s left is creating an MDL Pipeline
Matillion Data Loader Pipeline
We need to create a Data Pipeline to get started. For the purpose of this blog, we choose an Excel-sheet as the Source. Snowflake will be the Target Destination Warehouse.
Add the Pipeline
We select Snowflake as the Destination Warehouse. Therefore we can select the Destination Configuration we created earlier. The Pipeline gets a meaningful name.
Google Cloud Storage
In a next step we need to provide the URI of the GCP-Bucket. . When you copy the URI it seems to include a space at the beginning. If you do not remove this space when you enter the Connection URL in Matillion Data Loader, you will get an error; “That URL is not a valid GS connection URL” . Copy the URI to the next step in the MDL Data Pipeline creation.
Google Cloud Connection Details
To be able to load the Excel-file from the Google Cloud Storage, we need to specify the Connection details. This means we select the Google Storage Type. After that we can select the GCP Credentials we created earlier. Finally we fill in the copied URI from the previous step. Make sure it has the right format and no spaces at the beginning.
Without specifying Connection Options, we go straight to te selection of the Data Sources. First click the ‘+’-sign and then select the appropriate sheet. Make sure you give the sheet in Excel a meaningful name. This name will be used to name the tabel in Snowflake.
In the following step, you select the columns you want to add to the table in Snowflake. By default all columns are selected. Also the empty ones.
We arrived at the fifth step in the Data Pipeline setup. Here we are going to tell Snowflake which table we are going to create by specifying a prefix. Combined with the sheet name, this will be the Target Table name. On top of that we specify which Warehouse to use and in which Database Schema we are going to create the table and load the data.
Verify the results in Snowflake
If everything works out like planned the MDL Data Pipeline will run and load the data into Snowflake. We can head over there and verify the results. As we can see there is a STG_CHANNELS table in the DAANALYTICS-database in the STAGING-schema. The columns match the columns specified in the sheet. Also the data is exactly the same as in the sheet. Apart from to Matillion-specific metadata columns.
Creating a Data Pipeline in Matillion is relatively easy. It needs some preparation at both ends. After that it is a matter of selecting and configuring. Most things speak for itself. One downside of MDL is that you cannot reconfigure a Data Pipeline. If something is not working like expected, you need to start all over again. Luckily enough creating Data Pipelines in Matillion Data Loader is not that hard.
Thanks for reading.
Originally published at https://daanalytics.nl on February 29, 2020.