Thursday, August 25, 2011

Environment Variables for The Parallel Engine

Source : publib.boulder.ibm.com


Set the listed environment variables depending on whether your environment meets the conditions stated in each variable.

Network settings


APT_IO_MAXIMUM_OUTSTANDING
If the system connects to multiple processing nodes through a network, set the APT_IO_MAXIMUM_OUTSTANDING environment variable to specify the amount of memory, in bytes, to reserve for the parallel engine on every node for TCP/IP communications. The default value is 2 MB.
If TCP/IP throughput at that setting is so low that there is idle processor time, increment it by doubling the setting until performance improves. If the system is paging, however, or if your job fails with messages about broken pipes or broken TCP connections, the setting is probably too high.
APT_RECVBUFSIZE
If any of the stages within a job has a large number of communication links between nodes, specify this environment variable with the TCP/IP buffer space that is allocated for each connection. Specify the value in bytes.The APT_SENDBUFSIZE and APT_RECVBUFSIZE values are the same. If you set one of these environment variables, the other is automatically set to the same value. These environment variables override the APT_IO_MAXIMUM_OUTSTANDING environment variable that sets the total amount of TCP/IP buffer space that is used by one partition of a stage.
APT_SENDBUFSIZE
If any of the stages within a job has a large number of communication links between nodes, specify this environment variable with the TCP/IP buffer space that is allocated for each connection. Specify the value in bytes.The APT_SENDBUFSIZE and APT_RECVBUFSIZE values are the same. If you set one of these environment variables, the other is automatically set to the same value. These environment variables override the APT_IO_MAXIMUM_OUTSTANDING environment variable that sets the total amount of TCP/IP buffer space that is used by one partition of a stage.

Transform library

If you are working on a non-NFS MPP system, set the APT_COPY_TRANSFORM_OPERATOR environment variable to true to enable Transformer stages to work in this environment.IBM® WebSphere® DataStage® and QualityStage users must have the appropriate privileges to create project directory paths on all the remote nodes at runtime. This environment variable is set to false by default.

Job monitoring

By default, the job monitor uses time-based monitoring in the IBM WebSphere DataStage and QualityStage Director. The job monitor window is updated every five seconds. You can also specify that the monitoring is based on size. For example, the job monitor window is updated based on the number of new entries. To base monitoring on the number of new entries, set a value for the APT_MONITOR_SIZE environment variable. If you override the default setting for the APT_MONITOR_TIME the setting of the APT_MONITOR_SIZE is also overridden.

Transport block size

To automatically calculate the block size for transferring data internally as job runs, set the APT_AUTO_TRANSPORT_BLOCK_SIZE to True.
Alternately, you can specify a specific block size. By default, this environment variable is set to 32768.

Detailed information about jobs

To produce detailed information about jobs as they run, set the APT_DUMP_SCORE value to True. By default, this environment variable is set to False.

C++ compiler

The environment variables APT_COMPILER and APT_LINKER are set at installation time to point to the default locations of the supported compilers. If your compiler is installed on a different computer from the parallel engine, you must change the default environment variables for every project by using the Administrator client.

Temporary directory

By default, the parallel engine uses the C:\tmp directory for some temporary file storage. If you do not want to use this directory, assign the path name to a different directory by using the environment variable TMPDIR.

DataStage logging changes introduced at 8.1 can impact performance and can also result in missing log entries

Abstract
DataStage logging was changed at release 8.1 to log job run detail records into the operational repository, (xmeta) rather than the local project level log files, (RT_LOGxxx) that we utilized in prior releases for each job. As a result of this change we have seen the following issues:

Job run times can increase and Information Server client applications may not be as responsive depending on the amount of log data that is generated by the jobs.

In some cases no job log entries are written to the log as viewed via the DataStage Director client or Web Console, even though jobs appear to run and in some cases job log detail entries may be missing or do not show up until later.

Job log purge operations may run slowly or fail depending on the amount of log entries.

Content

These issues can be worked around by reverting to the logging mechanism that was in place prior to release 8.1 by implementing the following project level changes on the DataStage engine server.
Edit the project level DSParams file (replace %projectName% with the actual project name) typically located in:
 /opt/IBM/InformationServer/Server/Projects/%projectName% for Linux/UNIX andC:\IBM\InformarmationServer\Server\Projects\%projectName% for Windows and modify the following 2 lines as shown to revert to pre 8.1 logging:

RTLogging=1
ORLogging=0

Keep in mind that newly created projects inherit their settings from the DSParams file that is located in /opt/IBM/InformationServer/Server/Template/DSParams for Linux/UNIX and C:\IBM\InformationServer\Server\Template\DSParams for Windows by default and that it should also be modified to ensure that new projects use the pre 8.1 logging mechanism.

After switching to RTLogging, the existing log details entries in the repository can still be viewed via the Web Console or Server Console but they will not be accessible using the DataStage Director. These log entries should be purged when they are no longer required by scheduling a background purge of the log entries; up to 10,000 at a time is suggested to minimize the memory requirements and to ensure that we do not run into WebSphere Application Server out of memory issues by trying to purge all the log entries as one task.

The following commands will schedule a task to delete up to 10,000 log entries at a time and will loop until all the DataStage entries are purged. This could take some time depending on how many log entries you have to purge. This method has the advantage in that it can be run without stopping other activities.

Use the following command line, (from the …InformationServer/ASBServer/bin directory) to create the scheduled auto purge task (replace <user> <password> with a suite admin user/password):

For Windows:

LoggingAdmin -user <user> -password <password> -create -schedule -name "DS job event purge task" -frequency -minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN 

For Linux/UNIX:

./LoggingAdmin.sh -user <user> -password <password> -create -schedule -name "DS job event purge task" -frequency -minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN 

After all DS log events have been purged, the scheduled task can be deleted with the following command line:

For Windows:

LoggingAdmin -user <user> -password <password> -delete -schedule -name "DS job event purge task" 

For Linux/UNIX:

./LoggingAdmin.sh -user <user> -password <password> -delete -schedule -name "DS job event purge task"

The XMETA Repository for Information Server 8.1 grows rapidly until it runs out of disk space


Question

Why does the xmeta repository for Information Server v8.1 grow in size rapidly (much faster than in v8.0) until running out of disk space. Also, how do we recover from this situation if no more space can be allocated?

Note: The following document discusses several topics related to the size management of the Xmeta repository. Before you perform any of the suggested procedures to free up space in the Xmeta repository you must obtain a backup of the database.




Impact of Xmeta Logging (ORLogging) in the growth of the Repository


Beginning with Information Server (IIS) 8.1, sites may experience a rapid growth in the size of the xmeta repository due to IIS 8.1 changing the default logging method to "xmeta logging" resulting in all job log messages written to the xmeta repository. Prior to IIS 8.1, only non-job messages were stored in xmeta. As a result, sites that do not closely monitor or regulate the volume of logged messages may allow the xmeta repository too large on busy systems resulting in errors. Additionally, xmeta logging has performance implications and some known problems that can occur on systems that do not yet have fix pack 1 applied.


The following technote discusses the xmeta logging change in more detail, including how to switch to the original logging method, and also how to schedule a purge of DataStage job log messages already logged to the xmeta repository:


DataStage logging changes introduced in Information Server v8.1


In addition to the actions listed in the above technote, further analysis of the xmeta repository may be needed to determine if the growth was primarily due to message logging or other activities, and to adjust the Information Server and database configuration to best meet your needs.




Managing job log message histories with disk logging
  • After RTLogging is enabled you won't be able to use the Information Server Console to see new job logs. It will still show old log messages saved before the configuration change, plus messages generated by other IIS components such as the Information Service Director (ISD), but new job logs won't be accessible thought this tool. To view job logs after RTLogging is enabled you need to use DataStage Director Client. 
  • After you switch to RTLogging you can use DataStage Administrator client to define purge actions for all jobs in a specific project. Automatic purge can be configured based on either message age or the number of job runs (i.e. save messages from only the last 3 runs of a given job). By setting auto-purge you can prevent job logs from growing excessively. Be aware that changes here only affect new created jobs. 
  • The DataStage Directory client can be used to purge the history of a job after RTLogging has been enabled. To purge logs already stored in xmeta repository you should use either the console options listed below, or the commands listed in the previously referenced technote. The instructions for purging messages via the consoles are described in the following InfoCenter topics:
    Purging log entries via the Web Console for Information Server
    Purging log entries via the Console for Information Server
  • Note that purging messages from the xmeta repository will not directly reduce the size of the table or the tablespace used by the xmeta repository. The table size can be reduced after purging messages by performing a table reorganization as discussed at the end of this technote.


Identifying largest tables and purging LoggingEvent Table

Use the following queries to determine what are the largest tables in Xmeta and determine the best course of action:
  • In DB2:

    SELECT substr(tabname, 1, 25) tabname, substr(tabschema, 1, 25) tabschema,
    ((fpages * pagesize) / (1024*1024)) Size_in_MB
    FROM syscat.tables t, syscat.tablespaces ts
    WHERE fpages > 0 AND t.tbspace = ts.tbspace
    ORDER BY card DESC
    FETCH FIRST 50 ROWS ONLY 
  • In Oracle:

    SELECT * FROM
    (SELECT segment_name, owner, bytes/1024/1024 Size_in_MB
    FROM DBA_SEGMENTS ORDER BY bytes/1024/1024 DESC )
    WHERE ROWNUM <= 50

The above queries will show the 50 largest tables in the xmeta database. The column Size_in_MB gives an approximation of the spaced used by each table. If the largest table is LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F then you can purge messages in this table to remove unnecessary rows.

If the largest table is LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F then you should identify which message category is creating most of the rows in this table. You can run the following command to identify these categories:

SELECT CATEGORYNAME_XMETA, count(*)
FROM LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F
GROUP BY CATEGORYNAME_XMETA
ORDER BY count(*) DESC

Once you have identified the message category that has the most rows then you can delete messages belonging to that category. For example, if most of the messages have category of IIS-DSTAGE-RUN then those are job log messages and can be removed by running the LoggingAgent command to purge messages with that category as discussed in the following technote:

DataStage logging changes introduced in Information Server v8.1

NOTE: We recommend using only the methods discussed this technote to purge messages from the logging table. You should not need to directly clear the table via database commands. Also do not remove records from any table other than the LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F table. Other tables in the xmeta repository store data that should not be deleted and doing so could render the repository unusable.


Analyzing Xmeta tablespaces in DB2

To view the size/status of the xmeta tablespace open a DB2 command prompt and enter commands:
CONNECT to xmeta
LIST TABLESPACES SHOW DETAIL

This will list all the table spaces, including xmeta. For each it will show the number of used and free pages and page size. If there are few free pages then your options are to increase the disk space available to the xmeta tablespace, or by freeing pages within the tablespace by purging unnecessary data such as old job log messages.

Note that if you purge old messages and then reorganize the table where messages were removed, that may free up pages within the tablespace but it will not reduce tablespace size on disk. For example, if the xmeta tablespace was using 25GB of space and 99% full, then after a purge/reorg the tablespace will still have size of 25GB but you may only be 50% full, meaning that additional usage of xmeta can occur without needing to increase tablespace size.

Normally DB2 can increase tablespace size as needed until the disk/volume runs out of space. Thus if you only had a small amount of space, such as 4GB, allocated for DB2, you may need to increase disk space allocation in addition to managing the deletion of aged messages.


Reorganizing DB2 tables after purging messages to free pages in tablespace
When messages have been purged from the LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F table, the space used by those messages is not immediately returned to the tablespace as free pages. That occurs when the table is reorganized. Some database configurations may periodically reorganized tables automatically. To force a DB2 table to reorganize immediately, you can issue the following DB2 command:

db2 REORG TABLE XMETA.LOGGING_XMETAGEN_LOGGINGEVENT1466CB5F use XMETATEMP;

Once the above command completes, you can run the LIST TABLESPACES SHOW DETAIL command again to confirm how many pages were freed in the tablespace and determine if additional action is still needed.

Please note that this action frees pages WITHIN the tablespace, it does not actually reduce the amount of disk space that the tablespace is using. To do that would require dropping and recreating the tablespace or moving the database to a new volume, which are actions beyond the scope of this technote.

If the large xmeta tablespace caused a disk full condition, then you will need to add additional space to that volume to prevent problems for any applications which run on that volume (including DB2 since it does have need for space outside the xmeta tablespace such as for other databases and DB2 log files).



Source : IBM FAQ

How to improve Job Performance?

There could be various reasons behind this issue. I will list down few of them here. In my case the IBM Information server set-up is on Windows Server 2003 SP2 and with IBM Information server 8.1. It is a SMP set-up and meta data repository database is on the same server as well. The meta data repository database is on SQL Server 2005.

Now server hang issue can occurred when

1) Metadata repository database detects a deadlock condition and choose failing job as the victim of the deadlock.
2) Log maintenance is ignored.
3) Temp folders are not maintained periodically.

I will try to explain above three points in detail below:

1) Occurrence of deadlock into meta data repository database - I have seen this scenario in the DataStage 8.1 with no fix packs installed in it. If you have fix packs (released later) installed then you may not get this problem at all. IBM Information Server throws an exception like "[IBM][SQLServer JDBC Driver][SQLServer]Transaction (Process ID 59) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction." This exception makes the whole process unresponsive which in turn hangs the server and needs a physical reboot. However deadlock would always be there and we can only reduce their occurrence. Following two actions can be performed to reduce the possibility of having deadlock situation:

a) Keep the user interaction minimized during the transactions. This is more related to SQL queries where user input is required.
b) Keep less number of records and in one batch. For example if you are using Oracle Enterprise stage to load data then by defining less array size would reduce the possibility of having deadlock during the job run.



2) Log maintenance is ignored - In DataStage 8.1 log gets stored into metadata repository database. A table called "logging_xmetagen_LoggingEvent1466cb5f" holds all the logs in XMETA database. If we ignore purging the logs periodically then this table's size would go very large and DMLs would take longer to execute. Hence we should schedule the log purging/clearing periodically. Logs purging/clearing can be scheduled from Director client.

To schedule the auto log purge, select a job and then by selecting Job menu --> Clear Log --> options from pop up window. You can set the logs to purge immediately after job run or set auto purge up to some last run or till a particular date. It would be better if we set the auto purge with "Older than (days)" setting.

One more thing to note here is that the table "logging_xmetagen_LoggingEvent1466cb5f" keep logs for all the projects present on the server. Hence we should not just fire a DML on this table as this could impact other projects as well. To remove entries from this table we can use the following command which creates a scheduler on the server using the logadmin utility and delete the logs.

This command will delete the logs from table "logging_xmetagen_LoggingEvent1466cb5f" irrespective to the project hence be careful while executing this command.

LoggingAdmin -user [DataStage Admin User] -Password [DataStage Admin Password] -create -schedule -name "DS job log purge" -frequency –minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN

Above scheduler can be deleted by following command:

LoggingAdmin –user [DataStage Admin User] -password [DataStage Admin Password] -delete -schedule -name "DS job log purge"

To check the number of records in "logging_xmetagen_LoggingEvent1466cb5f" table following query can be fired on XMETA database:

select count(*) from xmeta.logging_xmetagen_LoggingEvent1466cb5f

Having lots of logs into repository database would slow the log purge activity which in turn slows the performance of the job.



In DatStage 8.1 job log storage has changed from file system to database. In case your jobs logs gone very huge and you need some quick performance then you can switch back to the file system logging by just changing the following two project level parameters in DSparam in windows:

RTLogging=1
ORLogging=0

After changing the above settings logs would be stored in the file system and old logs can still be viewed into director client. Once we clear all the logs we should revert the changes to original setting. DSparam file can be found at \IBM\InformationServer\Server\Projects\YOUR_PROJECT\

3) Folders maintenance - There are few folder which needs maintenance in DataStage in windows environment. First one is tmp folder which store all the virtual dataset and other stuff while executing the jobs. Another one is UVtemp. We should make sure that this folder should not get full as this would cause the jobs to hang or slow. Keep clearing this folder periodically. Last but not the least is &PH& folder. Keep this folder clean as well. To remove the entries from &PH& folder "CLEAR.FILE &PH&" command can be executed from administrator client. To execute this command go to Administrator client --> Click on Projects tab --> Select the appropriate project --> Click on Command --> Paste CLEAR.FILE &PH& into command text box and click on Execute button. Once cleared you will see a popup window with a message "File &PH& has been cleared".

If we follow above listed maintenance activities then we can increase the performance of the jobs as well as keep the server safe from going into hung state. There are few more
tunings which we can perform on the server itself to make it suitable for large parallel jobs on Windows environment. 



Source : datastagetips

Thursday, August 4, 2011

ETL vs ELT

I have good old article from Vincent McBurney about comparison between ETL and ELT. Yeah I really like his way to write an article, keep writing Vincent :D.

Here's the article:
Every now and then I come across a blog entry that reminds me there are people out there who know a lot more about my niche than I do! This is fortunate as this week it has helped me understand ELT tools.
ETL versus ELT and ETLT

The world of data integration has it's own Coke versus Pepsi challenge - it's called ETL versus ELT. Not as exciting as Aliens versus Predator, not as compelling as Ali versus Frasier and not as sexy as Erin Brockovich versus whatever company that was ... but an important battle if you are in charge of a multi million dollar IT budget.

ETL (Extract, Transform and Load) is the coca cola in the challenge with Informatica and DataStage the champions in terms of license fees and market share. It is made up of software that transforms and migrates data on most platforms with or without source and target databases. Business Objects, SAS, Microsoft SSIS, Ab Initio and Cognos are also in the ETL corner.

ELT (Extract, Load and Transform) is the challenger and is now largely driven by RDBMS vendor Oracle with Oracle Warehouse Builder and Sunopsis. It consists of software that transforms and migrates data in a database engine, often by generating SQL statements and procedures and moving data between tables.

ELT technology was constrained by what the database was capable and since many came from RDBMS vendors they tended to be suitable for just one database platform. Eg. Oracle Warehouse Builder and Microsoft DTS. They were also lacking functionality as the vendor was more concerned with building a database rather than an ELT tool. Sunopsis was an exception as an ELT tool not owned by an RDBMS vendor (until theOracle acquired them ).

Informatica has recently moved into the the ETLT (Extract, Transform, Load and Transform) area with database pushdown optimization. This is standard ETL delivering to a target database and some extra sexy moves done moving it into more tables. Microsoft SSIS also has good ETLT capabilities within the SQL Server database.

Pros of each
I haven't had a lot of experience with ELT products but fortunately Dan Lindstedt from the B-Eye-Network blogs has been talking about this topic for years now and his recent entryETL, ELT - Challenges and Metadata has a great comparison. Here are his pros of each list below, visit his blog for further discussion and the Cons of each tool:
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware
I would add to this data quality. The ETL tools have a head start over ELT in terms of data quality integration with Informatica and DataStage integrating closely. The row-by-row processing method of ETL works well with third party products such as data quality or business rule engines.
And the Pros of ELT.
Pros:

* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.
I'm not sure whether that final point refers to data that is being processed from one RDBMS table to another on the same server or whether it also applies to cross database migrations. If you are moving between database platforms ELT may need to transfer the data and save it before it commences transformation whereas ETL can transfer and transform in the one stream.
Another Pro of ELT is that once the database is on the target platform it no longer places stress on the network, all further transformation is done on the RDBMS server. These Pros of ELT is why ETL tools have an ELT capability, either the simple interface of DataStage in running user-defined SQL and procedures or the advanced capabilities of Informatica to generate SQL transformation code.

Informatica Pushdown ELT
In 2006 Informatica made an each way bet in the challenge by putting ELT style push down optimization into the PowerCenter ETL tool. You can see a very good list of 8.1.1 and Pushdown optimization best practices at the ETL/ELT forum.

Sunopsis ELT
You can see Sunopsis ELT in action at Mark Rittmans Sunopsis Data Conductor : Creating an Oracle Project and in Moving Global Electronics Data using Sunopsis he evaluates the Oracle SQL generated.

DataStage ELT
In DataStage ELT is available through running a SQL statement from a database stage or by calling a stored procedure. It can generate SQL statements but not SQL procedural code so it is not as flexible or advanced as PowerCenter or Sunopsis. IBM may add this functionality within the next 18 months. They already have a DB2 ELT tool in the DB2 DataWarehouse Edition in the form of a SQL language generator for populating a Data Warehouse.

The Wrap
I am not convinced advanced ELT is good an ETL tool as it increases upgraditis.
Upgraditis afflicts ETL tools when they struggle to keep up with all the upgrades in the products the integrate with. Good ETL tools have native access to most common database engines and ERP systems. They constantly need to upgrade and certify these items against new versions of these products. When Informatica added pushdown optimization they bought into a whole new area of upgraditis and will have to keep the pushdown compatible with new versions of all the databases it supports.
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Monday, August 1, 2011

ETL Engines: VLDW & Loading / Transforming

You can read this good article from Dan E. Lindstedt :
I hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers.


In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it.


ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons:
a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on.
b) GUI Driven partitioning and parallelism
c) GUI Driven grid setup / administration and maintenance
d) Data Quality.
e) Access to unstructured data, and conversion to a structured world.
f) GUI driven connections to heterogeneous sources and targets



When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order)
* Ab-Initio
* Business Objects - Data Integrator
* Hummingbird - Genio
* IBM - DataStage
* Informatica - PowerCenter
* Microsoft - SSIS
* Oracle - Sunopsis (whom we haven't heard from since the acquisition)
* Pentaho - KETLE
* Talend
* Teradata - Warehousing Utilities



Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance.


Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem.


If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data.


Now wait a minute! You're telling me that just because I can, doesn't mean I should?
Yep... Too many tools make it too easy to "add another source, add another target, mix inserts, with updates, with deletes" all in the same data flow. This adds unwanted and unneeded complexity. Any time complexity enters, performance leaves.



I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall.


If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture.


With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data)
1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
3) Targeting more than 1 or 2 target tables
4) moving rows that are too wide through a single process
5) loading very large data sets to targets WITH INDEXES ON
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).



Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur?
Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....



Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.
Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.
By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.



Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.
Linux, and Unix do not have these limitations either.


So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test).


There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing).


Most ETL engines handle "blocks" of rows, which is why if you think transactionally, or architect transactionally (row by row), then performance will NEVER reach it's maximum. Let's say the ETL engine puts 150 rows in a block, and sends this block to the database. The database then caches this block until it reaches a commit point (in general).


The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk.
We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk.


This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based).


So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets.


The math:
800 million rows / 80,000 rows per second = 2.77 hours (about)
800 million rows / 8,000 rows per second = 27.77 hours (about)



A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines.
Thoughts? Comments? I'd love to hear from you.

Cheers,
Dan L


How to find and replace string in multiple files in UNIX environment

When you work with DataStage in UNIX environment, you will find a lot of daily activities for find and replace string in multiple files. There is simple step to do with use SED syntax and combine with Find.

Syntax:
$ find <directory> -type f -exec sed -i 's/<old_string>/<new_string>/g' {} \;

Hope it helps