Monday, August 1, 2011

ETL Engines: VLDW & Loading / Transforming

You can read this good article from Dan E. Lindstedt :
I hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers.


In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it.


ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons:
a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on.
b) GUI Driven partitioning and parallelism
c) GUI Driven grid setup / administration and maintenance
d) Data Quality.
e) Access to unstructured data, and conversion to a structured world.
f) GUI driven connections to heterogeneous sources and targets



When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order)
* Ab-Initio
* Business Objects - Data Integrator
* Hummingbird - Genio
* IBM - DataStage
* Informatica - PowerCenter
* Microsoft - SSIS
* Oracle - Sunopsis (whom we haven't heard from since the acquisition)
* Pentaho - KETLE
* Talend
* Teradata - Warehousing Utilities



Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance.


Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem.


If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data.


Now wait a minute! You're telling me that just because I can, doesn't mean I should?
Yep... Too many tools make it too easy to "add another source, add another target, mix inserts, with updates, with deletes" all in the same data flow. This adds unwanted and unneeded complexity. Any time complexity enters, performance leaves.



I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall.


If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture.


With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data)
1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
3) Targeting more than 1 or 2 target tables
4) moving rows that are too wide through a single process
5) loading very large data sets to targets WITH INDEXES ON
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).



Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur?
Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....



Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.
Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.
By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.



Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.
Linux, and Unix do not have these limitations either.


So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test).


There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing).


Most ETL engines handle "blocks" of rows, which is why if you think transactionally, or architect transactionally (row by row), then performance will NEVER reach it's maximum. Let's say the ETL engine puts 150 rows in a block, and sends this block to the database. The database then caches this block until it reaches a commit point (in general).


The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk.
We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk.


This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based).


So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets.


The math:
800 million rows / 80,000 rows per second = 2.77 hours (about)
800 million rows / 8,000 rows per second = 27.77 hours (about)



A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines.
Thoughts? Comments? I'd love to hear from you.

Cheers,
Dan L


No comments:

Post a Comment