Tips for choosing Data Load tools for your Data Warehouse…

When implementing a modern Data Warehouse like Snowflake, one of the early decisions you will need to make is “What DataLoad (data pipeline) tool should I use?”. "Data Load" tools are the part of your solution that moves raw data, from your source applications into your Data Warehouse. Below are a few useful tips I learned from my own experiences, that you might find useful….

*Note: This topic relates to my own experience with Snowflake EDW implementations. This content may get outdated and not be updated, as these tools change very quickly, particularly the SAAS tools. These views are my own opinions.

This blog assumes you will adopt the common ELT approach of using separate tools/processes for “Extracting/loading raw data into snowflake ” and “transforming (modeling) your data“

List out the applications(data sources) you want to extract data from, and evaluate which DataLoad tools have compatible connectors.

This should be 1 of the first steps you do, to determine determine which DataLoad tools have connectors for the source systems you need to extract data from.

Be cautious about any data sources (applications) that are old versions or self-hosted.

Old application versions are often not supported by DataLoad connecters. Some connectors will only support cloud variants of applications but don’t support the on-premise (self-hosted) variants. So test, and check compatibility carefully.

What is supported today might not be supported tomorrow.

E.g. Selfhosted JIRA is incompatible with most DataLoad tools, only JIRA Cloud variant is typically supported.

Watch out for tools with limited scheduling.

Tools like StitchData have limited scheduling functionality until you upgrade to a more expensive plan. This may not seem like a big deal, but it can cost you money if you don’t have the flexibility to skip hourly syncs of data during non-work hours and are forced to run frequent syncs 24 x 7.

It can also make it difficult to align your transformation processes with your DataLoad process.

Use more than 1 DataLoad tool – “always have a flexible plan B”.

There is a growing number of great tools on the market. Some are extremely easy to use and can be setup in minutes without writing any code. Others are code based and can take weeks to setup and configure a custom framework, and come with a steep learning curve. They all have there pros and cons.

DataLoad tools normally fall into 2 categories:

Category 1

Quick to setup.
GUI Wizard driven interface.
Zero-code, Zero development required
Usually cloud based SAAS (some also have deployable agents)
Very limited flexibility and options.
Fully managed solution. Source system upgrades and schema drift usually automatically managed for you.
Examples: Hevo, Fivetran, StitchData, Dataddo.

Category 2

Longer to setup.
Requires coding and development effort to create a custom configured data load pattern.
Sometimes SAAS, sometimes self hosted.
Flexible, highly configurable.
Comes with a learning curve and requires developer skillset to set up.
Examples: Azure Data Factory, AWS Glue

Using 2 tools - 1 from each of these categories is often a good approach.

I typically encourage people to use a Category 1 tools as much as possible (because its quick, easy and low maintenance), and use a Category 2 tool for any data sources that don’t work well with their Category 1 tool.

This strategy also gives you an alternative option if your Category 1 tools stops working, or becomes incompatible with your data source.

E.g. If there is no connector to a source system in your Category 1 tool, then you will probably need to use a category 2 tool and write some custom code to pull the data out using a more complex method. Such as interacting with an API or via SQL Database directly.

Don’t under estimate the impact of 1 of these tools not working on an established data warehouse, they are a critical point of failure. So its always good to have a plan B.

Make it easy to change DataLoad tools.

Do what you can to make your user facing data warehouse tables (and your transformation tool) as independent as possible from the "raw data" dumped into your data warehouse tables by your DataLoad tool. This makes it easier to change DataLoad tools in the future, if you need to.

A lot of these tools consume more than expected data warehouse resources to ingest data.

Most DataLoad tools don’t take advantage of SnowPipe, they typically use more resource intensive “COPY INTO” style data loading from an external/internal stage. A lot of these “Quick to get going” category 1 DataLoad tools are quite chatty with the EDW, as they interrogate your EDW tables to confirm what needs to be updated, and check if the schemas are in sync etc.. which results in a costly chatty process to get data synced.

But often its worth accepting a little extra Snowflake cost to enjoy effortless reliable data pipeline that manages itself. Especially in early stages, when your under pressure to get value from your EDW project quickly.

Trial different tools to get a good idea of cost.

Cost can be difficult to estimate. e.g. Fivetran has a mysterious (non-linear record count based) pricing scheme, that makes it almost impossible to estimate. Other tools are based on how many records get synced, which can also be difficult to predict without intimate knowledge of your source systems. I recommend trialing tools where possible so you know for sure. Most "Category 1" DataLoad tools are quick and easy to setup and have free trial periods. So we highly recommend trying them out when possible.

Don’t overlook your Snowflake costs.

Don’t forget to factor in the Snowflake (ingestion) costs, as well as your DataLoad tool costs, to get the full picture. Both of these costs can vary wildly depending on what tools you are using and how you are using them.

Load frequency can have a larger than expected effect on cost.

Increasing the frequency of data loads (e.g from daily to hourly) can increases your total record count processed by these data loads, quite significantly. This is because the same record might change multiple times throughout the day, so it could be processed multiple times (instead of once a day). But this all depends on what type of data your dealing with. So it helps to know the data in your source systems, and understand how (and when) the users are updating it.

Batch processes in your source systems can blow out your costs.

Be aware of batch processed updating your source systems data. Or any frequent large scale bulk updates.

e.g. If you have an hourly batch process that updates customer balances in your ERP every hour. Then that could potentially trigger a full resync of every customer record every hour.

1 record in your source system is not always 1 record in your EDW landing table.

This can have a huge effect on monthly pricing.

e.g. Fivetran creates multiple EDW records for each JIRA "Issue" record. (Fivetran creates a record for each custom field).

Other products like StitchData only create 1 record for each JIRA Issue (StitchData stuffs all this data into a JSON column of a single record. If your DataLoad tool pricing is based on # of records created, and you have a 100 custom JIRA fields, then your usage is going to be a 100 times higher than expected!!!

What about getting raw data out of my non-prod (DEV/TEST) applications?

Loading raw data from non-prod environments will often cost same as PROD environments (this applies to DataLoad tool costs, as well as Snowflake ingestion costs). We recommend sharing a single set of PROD raw data EDW tables across all your dev/test/uat/qa data warehouse processes. This is the simplest/cheapest approach that you should consider when starting out. If that doesn’t suit your business, then be prepared to double/triple up on usage costs.

Full Resyncing will be needed from time to time.

Full resyncing a systems data will result in a large increase in your monthly record consumption. So factor this in, and be prepared to do a full sync occasionally. It is inevitable that something will go wrong and you will need to do complete refreshes, at some stage of your data warehouse journey.

Many DataLoad tools offer “Free” initial load of data, but that is usually a 1 time thing only. So this wont help you if you need to do a full reload in the future. So factor that in to your planning.

Some DataLoad tools like StitchData allow you to temporarily increase record count of your plan for part of a month. Which is very helpful for these scenarios.

Factor in any security requirements – SSO etc..

If your workplace has other hard requirements like SSO, make that a requirement in your search for a DataLoad tool. A lot of SAAS DataLoad tools wont support SSO unless you buy an enterprise plan. Which might not be within budget for organizations when starting out. So its really important to know your requirements when selecting your DataLoad tools.

Security considerations for accessing on-premise data sources ?

Exposing self hosted on-premise applications externally to your cloud based DataLoad tool is often not an option for security conscious organizations. So that leaves you with 2 options:

Find a DataLoad tool that has an “On Premise Agent/gateway” e.g. Azure Data Factory.
Find a DataLoad tool you can run inside your network. (This eliminates ability to use cloud tools). e.g. AirByte.

Avoid DataLoad tools that have poor support.

Your DataLoad tool is a critical peice of your EDW. If your DataLoad tools stops working, then your EDW is out of date. So having reasonable support from the vendor is good piece of mind.

Different tools will create different looking EDW landing tables.

Just because 2 different DataLoad tools are moving data from the same source application, doesn’t mean the snowflake landing table will be identical.

e.g. Stitch and Fivetran create very different looking EDW landing tables for Salesforce, JIRA etc. So it pays to test these tools and see what type of table they produce.