Sunday, July 31, 2011

17 Mistakes that ETL Designers make with Very Large Data

Again, Vincent McBurney had written another good article :D,

17 Mistakes that ETL Designers make with Very Large Data

These are the mistakes that ETL designers can make when processing scary high data volumes.

Dan Lindstedt is a very large data guru and he has a series of outstanding blog posts on very large databases, the latest is ETL Engines: VLDW & Loading / Transforming. I first came across Dan on ETL/ELT forums where he has over 20,000 forum posts. He popped up on B-Eye-Network blogs. Dan has had no comments on his latest post series yet as B-Eye don't have a reader friendly design and it discourages reader participation. For example I just got three comments over a weekend on three old archived ETL posts. ITToolbox is a friendly place for reader participation.

My favourite part of Dan's latest post is the 17 mistakes that ETL Designers make with very large data. In fact that's the title I would have gone for! For some reason blog titles with a number in them attract more hits. I've shown Dan's list of 17 below with my own comments on how that impacts DataStage developers. I would love to hear your own contributions for common ETL design mistakes.

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.

Agree 100%. I believe in have at least three bands in your processing: Extract from source to file, process file to load ready dataset and load load ready dataset. For a standard target table load I would have an insert job, an update job, a delete job (if needed), a bulk load job (for large volumes).

2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.

I can see this working well for smaller volumes - and the combination of DataStage accessing data from multiple systems via a Federation Server plugin is intriguing, but this type of cross database joining would be nasty on very large volumes. Pulling the smaller data volumes into target tables or lookup datasets would be faster.

3) Targeting more than 1 or 2 target tables

Agree 100%. You could take it to two tables if there is a very close parent/child relationship like a master table and an attribute table, or a header table and a detail table. But that's the exception, most large volume ETL jobs should be preparing data for one target table.

4) moving rows that are too wide through a single process

I don't know how you get around this one. If you have a lot of columns then you gotta get it in there!

5) loading very large data sets to targets WITH INDEXES ON

DataStage makes this easy to manage, on a standard database tab (insert or bulk load activities) you can use the before-SQL tab to turn indexes off and the after-SQL tab to turn them back on. The statements on those tabs are run just once for each job (and not per row). You don't need indexes and keys if you have other ways to check your referential integrity.

6) not running a cost-based optimizer in the database

7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed

Examples of why you need to be in the good books with your DBAs!

"My DataStage job is running slow, I think there is something wrong with the database table."

"You sure it's not DataStage?"

"Can you check the table, maybe run a trace?"

"What is DataStage?"

"Could you just check the table, I've got the job running right now!"

"I ran a query on the table and it's running fine."

"Could you maybe get off Facebook and do a friggen trace!"

9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.

DataStage Parallel Jobs need a lot of RAM for the lookup stage. The IBM Information Server Blade starts with 4G of RAM per blade for two dual core CPUs on each blade.

10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur

I think the treblification of this one is an OS swapping joke. I am going to be talking about 64 bit processing with the IBM architects when I meet them at IoD 2007 next month. The Information Server can run on some 64 bit environments but it will be interesting to find out what plans IBM have for expanding this.

13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance

This is a tricky one - kind of like rows that are too wide - sometimes you need to do it and you hope the massively parallel architecture is up to it. DataStage can make almost every type of stage work in parallel so it can get away with extra steps, however this does make the job harder to debug and sometimes you get those random out of resource errors...

14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.

I'm not sure what Dan means by multi-passing and I'll ask in his comments thread.

15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.

More detail about this is in Dan's post. If you try to trap database rejects in a Server Job you use a reject link from a Transformer prior to the database stage. Dan points out that each reject row slows down the job by a factor of 4 as the ETL job stops processing to handle that reject row.

Parallel Jobs are more efficient as they use a reject link out of the database stage and on a parallel architecture can push the handling of those rows into a new process. I haven't seen any overhead in this design if you don't get rejects, and database rejects should be under 0.01% of your rows or else your design is faulty.

I've never try to trap bulk load or multi load or tpump or any other type of native database load errors back in DataStage and I don't think the stages can accept them anyway, I let the database handle them. On smaller volume jobs I use the database reject link and insert instead of bulk load a lot for a more robust auditing of rows.

16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).

A bit hard to avoid in a row-by-row ETL tool! The parallel architecture and caching and memory sharing and a bunch of other things make it fast.

17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Fortunately this is very hard to do in a DataStage job or people would misuse it! In DataStage looping on a per row basis can be done via lookups to database stored procedures or custom code modules but most standard DataStage stages do an action per row.

Think outside the box.

A lot of ETL sites have templates for building new jobs or design standards or example jobs. These are excellent for 90% of your ETL work however very large data jobs may need custom designs. You should look at what has been done in other jobs but also be ready to take a job into performance testing to try out dozens of other configurations. As Dan says in his post - "performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines".

A great series of posts from Dan with some cool Oracle and SQL Server tips in earlier posts.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Friday, July 29, 2011

Using DataStage 8 Parameter Sets to Tame Environment Variables

Another good article from Vincent Mc. Burney :

Using DataStage 8 Parameter Sets to Tame Environment Variables

Special Team Parameter Sets can remove some of the mystery from DataStage Parallel Job Environment Variables.

In a previous post I looked at How to Create, Use and Maintain DataStage 8 Parameter Sets. In this second of three posts on Parameter Sets I look at combining Environments Variables with Parameter Sets and in the final post I look at User Defined Environment Variables and Parameter Sets.

Parameter Sets have the potential to make Environment Variables much easier to add to jobs and easier to use across a large number of jobs.

Environment Variables and Parameter Sets

Environment Variables are set every time you log into a computer. They are set for all Unix, Linux and Windows logins and you can see them if you type in "Env" from a prompt. DataStage Parallel Jobs have a special set of Environment Variables that get added during a DataStage installation and they are exposed through the DataStage Administrator so you can edit them more easily. You can see most of them documented in Chapter 6 of the Parallel Job Advanced Developers Guide.

There used to be just two ways to set an environment variable for a job.

1) In the Administrator tool you set it centrally and that impacts every job that runs, this variable gets set in the session that starts any DataStage job:

2) In the DataStage Designer you add the same environment variable to a job via the Add Job Parameter screen and use it like an override just for that job:

Let's look at some of the problems with these environment variables prior to version 8:

Neither of those screenshots above shows a very friendly user interface. The parameters are in a long list, they have long and technical names, it's hard to work out how different parameters relate to each other.
You can bring in a DataStage guru who can spend weeks fine tuning your Environment Variable values for you in a performance testing environment - however it only takes one dunce to come along and using Administrator to change a setting and lose all that value.
It's very time consuming to add these environment variables to jobs.
If you use Sequence Jobs you will find yourself having to pass through values from the Sequence job level to the parallel job level in the Job Activity stage properties for every single parameter leading to lots of time spent configuring Sequence jobs.

Parameter Sets change all this. Imagine if you could add Environment Variables to a job choosing from a shorter list with a group of environment variables and a name that indicates what that group of variables is trying to achieve:

By creating some "special team" Parameter Sets and adding environment variables to them we simplify the creation and management of these values. A DataStage parallel guru sets them up at the beginning of a project, they are performance tested to verify they work and then all developers who follow can benefit from using those Parameter Sets. You need to recompiled the job if you add or remove a Parameter Set or a parameter from a Parameter Set but apart from that no changes to the job are necessary.

I have created some example Parameter Sets full of Environment Variables to illustrate how this works. The first two scenarios show how to create a Parameter Set for very high and very low volume jobs. This lets you setup your project wide variables to suit medium jobs or "all comers" and lets you override specific settings for the extremes of data volumes.

High Volume Job

The idea here is you choose a typical high volume job and test the hell out of it using all the DataStage reporting and performance monitoring software and then via trial and error you tune some environment variables in a Parameter Set to deliver faster performance. You then apply that Parameter Set to all similar high volume jobs.

Testing will show whether you can use on Parameter Set for all high volume jobs or whether you need different Parameter Sets for different types of jobs - such as those that write to file versus those that write to a database.

For high volume jobs the first environment variables to look at are:

$APT_CONFIG_FILE: lets you define the biggest config file with the most number of nodes.
$APT_SCORE_DUMP: when switched on it creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators. Useful for finding out what your high volume job is doing.
$APT_PM_PLAYER_TIMING: this reporting option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming. Good for spotting bottlenecks.

One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job. This is done by turning off auto sorting in high volume jobs:

APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates. You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.

Buffering is another thing that can be tweaked, it controls how data is passed between stages, usually you just leave it alone but on a very high volume job you might want custom settings:

APT_BUFFER_MAXIMUM_MEMORY: Sets the default value of Maximum memory buffer size.
APT_BUFFER_DISK_WRITE_INCREMENT: For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size may be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting may not exceed max_memory * 2/3.
APT_IO_MAXIMUM_OUTSTANDING: Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number may need to be increased.
APT_FILE_EXPORT_BUFFER_SIZE: if your high volume jobs are writing to sequential files you may be overheating your file system, increasing the size of this value can deliver data to files in bigger chunks to combat long latency.

These are just some of the I/O and buffering settings.

Low Volume Job

By default a low volume job will tend to run slowly on a massively scalable DataStage server.

Many less environment variables to set as low volume jobs don't need any special configuration. Just make sure the job is not trying to partition data as that could be overkill when you don't have a lot of data to process. Partitioning and repartitioning data on volumes of less than 1000 rows makes the job start and stop more slowly:

APT_EXECUTION_MODE: By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS, MANY_PROCESS and NO_SERLIALIZE.
$APT_CONFIG_FILE: lets you define a config file that will run these little jobs on just one node so they don't try any partitioning and repartitioning.
$APT_IO_MAXIMUM_OUTSTANDING: when a job starts on a node it is allocated some memory for network communications - especially the partitioning and repartitioning between nodes. This is set to 2MB but when you have a squadron of very small jobs that don't partition you can reduce this size to make the job start faster and free up RAM memory.

Other Parameter Sets

You can set up all your default project Environment Variables to handle all data volumes in between. You can still have a Parameter Set for medium volume jobs if you have specific config files you want to use.

You might also create a ParameterSet called PX_MANY_STAGES which is for any job that has dozens of stages in it regardless of data volumes.

APT_THIN_SCORE: Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.

This can be combined with the large volume Parameter Set in a job so you have extra configuration for high volume jobs with many stages.

You might also create a ParameterSet for a difficult type of source data file when default values don't work, eg. PX_MFRAME_DATA:

APT_EBCDIC_VERSION: Certain operators, including the import and export operators, support the €œebcdic€� property specifying that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use.
APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL: When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.

SAS is another operator that has a lot of configurable environment variables because when you are reading or writing native SAS datasets or running a SAS transformation you are handing some of the control over to SAS - these environment variables configure this interaction:

APT_HASH_TO_SASHASH: can output data hashed using sashash - the hash algorithm used by SAS.
APT_SAS_ACCEPT_ERROR: When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.
APT_NO_SAS_TRANSFORMS: WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.

You can group all known debug parameters into a single debug file to make it easier for support to find:

APT_SAS_DEBUG: Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log. Don't put this into your SAS Parameter Set as the support team might not be able to find it or know it exists.
APT_SAS_DEBUG_IO: Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.
APT_SAS_SCHEMASOURCE_DUMP: When using SAS Schema Source, sauses the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.

So a new developer who is handed a high volume job does not need to know anything about environment variables, they just need to add the right ParameterSet to the job. And if an experienced developer decides a new environment variable needs to be added to high volume jobs they just add it to the central ParameterSet and recompile all the jobs that use it. The "Where Used" function will help identify those jobs.

ParameterSets and environment variables make a powerful combination. ParameterSets can act as a layer that simplifies environment parameters and makes them easier to add to jobs.

Thursday, July 28, 2011

DS Command

This is some DS commands, but we have to use it wisely especially for unexperienced user, because it sometimes can make your DS become unstable.

DS.TOOLS Menu
Option Text on Screen
1 Report on project licenses
2 Rebuild Repository indices
3 Set up server-side tracing >>
4 Check integrity of job files
5 Administer processes/locks >>
6 Adjust job tunable properties

DS.TRACE Menu
Option Text on Screen
1 Query trace status
2 Turn tracing on
3 Turn tracing off
4 List trace records
5 Delete all trace records

DS.PLADMIN Menu
Option Text on Screen
1 List all processes
2 List state of a process
3 List state of all processes in a job
4 List all locks
5 List locks held by a process
6 List locks held by all processes in a job
7 Clear locks held by a process
8 Clear locks held by all processes in a job
9 Logout a process
10 Logout all processes in a job
11 Clear status file for a job

DS.TOOLS Menu
Option Command
1 DSR_LICENSE REPORT
2 DS.REINDEX ALL
3 DS.TRACE
4 DS.CHECKER JOBS
5 DS.PLADMIN
6 DS.TUNEPROPS

DS.TRACE Menu
Option Command
1 TRACE
2 TRACE ON
3 TRACE OFF
4 TRACE LIST
5 TRACE CLEAR

DS.PLADMIN Menu
1 DS.PLADMIN.CMD LIST PIDS *
2 DS.PLADMIN.CMD LIST PIDS
3 DS.PLADMIN.CMD LIST PIDS
4 DS.PLADMIN.CMD LIST LOCKS *
5 DS.PLADMIN.CMD LIST LOCKS
6 DS.PLADMIN.CMD LIST LOCKS
7 DS.PLADMIN.CMD CLEAR LOCKS
8 DS.PLADMIN.CMD CLEAR LOCKS
9 DS.PLADMIN.CMD CLEAR PIDS
10 DS.PLADMIN.CMD CLEAR PIDS
11 DS.PLADMIN.CMD CLEAR STATUS

Error : DSR_ExecJob (Action=5) when running a job

I got how to resolve this issue from IBM Support

Problem(Abstract)

In DataStage Director, the following errors occur when try to view folders in a project:

Error calling subroutine: DSR_EXECJOB (Action=5); check DataStage is set up correctly in project <project name> (The connection is broken (81002))

Error calling subroutine: *DataStage*DSR_SELECT (Action=3); check DataStage is set up correctly in project <project name> (The connection is broken (81002))

Cause

The errors were received because the Logging Agent was not running.

Diagnosing the problem

Check permission's on the project. Verify the user has RWX permission's on the files in the project.
Verify the Agents, Logging Agent and ASB Agent, are running:
- On Unix/Linux execute the following:
  
  ps -ef | grep java | grep Agent
  
  If the Agents are not running they need to be started. When the Agents are running you would see entries similar to the following:
  
  $ ps -ef | grep java | grep Agent
  
  root 295106 1 0 Feb 03 - 9:09 ./../apps/jre/bin/java -Djava.security.auth.login.config=./../conf/auth.conf -Dcom.ibm.CORBA.ConfigURL=file:./../conf/sas.client.props -Djava.util.logging.manager=com.ibm.ws.bootstrap.WsLogManager -Djava.util.logging.configureByServer=true -classpath ./../conf:./../lib/java/ACS_client.jar:./../lib/java/ACS_common.jar:./../lib/java/ASB_utils.jar:./../lib/java/ASB_util_client.jar:./../lib/java/WASClient.jar com.ascential.acs.logging.agent.LoggingAgentSocketImpl
  
  root 413844 1 0 Feb 03 - 76:25 /opt/IBM/InformationServer/ASBNode/bin/RunAgent -Xbootclasspath/a:conf -Djava.ext.dirs=apps/jre/lib/ext:lib/java -Djava.class.path=conf -Djava.security.auth.login.config=/opt/IBM/InformationServer/ASBNode/conf/auth.conf -Dcom.ibm.CORBA.ConfigURL=file:/opt/IBM/InformationServer/ASBNode/conf/sas.client.props -Djava.util.logging.manager=com.ibm.ws.bootstrap.WsLogManager -Djava.util.logging.configureByServer=true -class com/ascential/asb/agent/impl/AgentImpl
- On Windows go to Control Panel > Administrative Tools > Services and verify the ASB Agent and Logging Agent services are started.

Resolving the problem

To start the Agents:

On Unix/Linux run the following command as root to start the ASB Agent and the Logging Agent.

/opt/IBM/InformationServer/ASBNode/bin/NodeAgents.sh start

The path shown in this command assumes that the product is installed in the default location. Your path will vary if you installed IBM Information Server in a different location. Then verify the Agents are now running (see step 3 above). If the Agents did not start successfully you can check the directory ../u2/IS810/IBM/InformationServer/ASBNode for the following files which may have additional information:

asbagent_startup.err
asbagent_startup.out
logging-agent.err
logging-agent.out
On Windows go to Control Panel > Administrative Tools > Services and start the ASB Agent and Logging Agent.

Wednesday, July 27, 2011

How to solve Error:No conductor nodes found in export node pool

If you find this error when you want to write to a sequential stage, it usually happens when conductor node has different pool with other (process) nodes or APT_PM_CONDUCTOR_NODENAME may has been set manually.

To solve this problem, you may try to add machine_name before the file path. For example:
<Machine_Name>:<Filename Path>
eq: ETL_Machine_1:/my_etl_source/output/my_dest_filename.txt

Sunday, July 24, 2011

10 Ways to Make DataStage Run Slower

I've good article from one of my favourite datastage's guru, Vincent McBurney, it's hillarious but really get the point and must remember by all datastage developers :

Vincent said:

"Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!

The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.

1. Use the same configuration file for all your jobs.

You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!

2. Use a sparse database lookup on high volumes.

This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!

3. Keep resorting your data.

Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!

4. Design single threaded bottlenecks

This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.

5. Turn on debugging and forget that it's on

In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!

6. Let the disks look after themselves

Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.

7. Keep Writing that Data to Disk

Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.

8. Validate every field

A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.

9. Write extra steps in database code

The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".

10. Don't do Performance Testing

Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way."
Thanks a lot Vincent,Sandy

How to check Runtime Information to performance tuning?

I've got this good article from IBM Info center :

Runtime information

When you set the APT_PM_PLAYER_TIMING environment variable, information is provided for each operator in a job flow. This information is written to the job log when the job is run.

An example output is:

##I TFPM 000324 08:59:32(004) <generator,0> Calling runLocally: step=1, node=rh73dev04, op=0, ptn=0
##I TFPM 000325 08:59:32(005) <generator,0> Operator completed. status: APT_StatusOk  elapsed: 0.04  
user: 0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(006) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1, ptn=0
##I TFPM 000325 08:59:32(012) <peek,0> Operator completed. status: APT_StatusOk  elapsed: 0.01  user: 
0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(013) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1, ptn=1
##I TFPM 000325 08:59:32(019) <peek,1> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 
0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)¨

This output shows that each partition of each operator has consumed about one tenth of a second of CPU time during its runtime portion. In a real world flow, we'd see many operators, and many partitions.

It is often useful to see how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it might mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.

If one operator is using a much larger portion of the CPU than others, it might be an indication that you've discovered a problem in your flow. Common sense is generally required here; a sort is going to use dramatically more CPU time than a copy. This will, however, give you a sense of which operators are the CPU hogs, and when combined with other metrics presented in this document can be very enlightening.

Setting the environment variable APT_DISABLE_COMBINATION might be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care.

Unlike the job monitor cpu percentages, setting APT_PM_PLAYER_TIMING will provide timings on every operator within the flow.

How to debugging your DataStage Parallel Jobs

Here is a good article about tips for debugging a datastage job :

Steps

Enable the following environment variables in DataStage Administrator:

APT_PM_PLAYER_TIMING – shows how much CPU time each stage uses
APT_PM_SHOW_PIDS – show process ID of each stage
APT_RECORD_COUNTS – shows record counts in log
APT_CONFIG_FILE – switch configuration file (one node, multiple nodes)
OSH_DUMP – shows OSH code for your job. Shows if any unexpected settings were set by the GUI.
APT_DUMP_SCORE – shows all processes and inserted operators in your job
APT_DISABLE_COMBINATION – do not combine multiple stages in to one process. Disabling this will make it easier to see where your errors are occurring.

Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead.

Use row generator stage to generate sample data.

Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH&

To catch partitioning problems, run your job with a single node configuration file and compare the output with your multi-node run. You can just look at the file size, or sort the data for a more detailed comparison (Unix sort + diff commands).

Saturday, July 23, 2011

How does Hash Partition works in Datastage | Datastage Tutorials-Datastage ETL Tool

You can read a full reference about this blog in:
How does Hash Partition works in Datastage | Datastage Tutorials-Datastage ETL Tool

Happy learning,
Sandy

How to release Read Only Job?

On the Administrator command:
UPDATE DS_JOBOBJECTS SET READONLY = 'NRO' WHERE READONLY = 'RO'

Rob Vonk's Function : GetXMLProperty

Source: Rob Vonk's GetXMLProperty Function

Name: GetXMLProperty
Author: Rob Vonk
Language: Datastagebasic
Description:
Get property value from XML part. I use it to get data from Datastage XML exports
Snippet:

Function GetXMLProperty(XMLPart,PropertyName)

Result = ""

Property = '<Property Name="' : PropertyName : '">'

P = Index(Iconv(XMLPart,"MCU"),Iconv(Property,"MCU"),1)

If P > 0 then

  TempXML = XMLPart[P + Len(Property),Len(XMLPart)]

  P = Index(TempXML,'<',1)

  If P > 0 then

     TempXML = TempXML[1,P-1]

     Result = TempXML

  end 

end 

Ans = Result

DataStage Functions and Routines

Rob Vonk's article about DataStage Functions and Routines:

$Define

Defines identifiers that control program compilation or supplies replacement text for an identifier.

$IfDef

Tests an identifier to see if it is defined or not defined.

$IfNDef

Tests an identifier to see if it is defined or not defined.

$Include

Inserts source code contained in a separate file and compiles it along with the main program.

$Undefine

Removes an identifier that was set using the $Define statement. If no identifier is set, $Undefine has no effect.

Abs

Returns the absolute (unsigned) value of a number.

ACos

returns the arc-cosine of number in degrees. ACos is the inverse of Cos.

Alpha

Checks if a string is alphabetic. If NLS is enabled, the result of this function is dependent on the current locale setting of the Ctype convention.

Ascii

Converts the values of characters in a string from EBCDIC to ASCII.

ASin

returns the arc-sine of number in degrees. ASin is the inverse of Sin.

ATan

returns the arc-tangent of number in degrees. ATan is the inverse of Tan.

BitAnd

compares two integers bit by bit. For each bit, it returns bit 1 if both bits are 1; otherwise it returns bit 0.

BitNot

inverts the bits in an integer, that is, changes bit 1 to bit 0, and vice versa. If bit.number is specified, that bit is inverted; otherwise all bits are inverted.

BitOr

compares two integers bit by bit. For each bit, it returns bit 1, if either or both bits is 1; otherwise it returns bit 0.

BitReset

resets the specified bit to 0. If it is already 0, it is not changed.

BitSet

sets the specified bit to 1. If it is already 1, it is not changed.

BitTest

tests if the specified bit is set. It returns 1 if the bit is set; 0 if it is not.

BitXOr

compares two integers bit by bit. For each bit, it returns bit 1 if only one of the two bits is 1; otherwise it returns bit 0.

Byte

Lets you build a string byte by byte.

ByteLen

Returns the length of a string in bytes

ByteType

Determines the internal function of a particular byte.

ByteVal

Determines the value of a particular byte in a string.

Call

Calls a subroutine.

Case

Alters the sequence of execution in the program according to the value of an expression.

Cats

Concatenates two strings.

Change

Replaces one or more instances of a substring.

Char

Generates an ASCII character from its numeric code value.

Checksum

Returns a checksum value for a string.

CloseSeq

Closes a file after sequential processing.

Col1

Returns the character position preceding the substring specified in the most recently executed Field function.

Col2

Returns the character position following the substring specified in the most recently executed Field function.

Common

Defines a common storage area for variables.

Compare

Compares two strings. If NLS is enabled, the result of this function depends on the current locale setting of the Collate convention.

Convert

Replaces every instance of specified characters in a string with substitute characters.

Cos

returns the cosine of an angle. number is the number of degrees in the angle. Cos is the inverse of ACos.

CosH

returns the hyperbolic cosine of an angle. number is the number of degrees in the angle.

Count

Counts the number of times a substring occurs in a string.

CRC

Returns a 32-bit cyclic redundancy check value for a string.

CRC32

Returns a 32-bit cyclic redundancy check value for a string.

Date

Returns a date in its internal system format.

DCount

Counts delimited fields in a string.

Deffun

Defines a user-written function.

Dimension

Defines the dimensions of one or more arrays.

Div

Divides one number by another.

DownCase

Converts uppercase letters in a string to lowercase. If NLS is enabled, the result of this function depends on the current locale setting of the Ctype convention.

DQuote

Encloses a string in double quotation marks.

DSAttachJob

Attaches to a job in order to run it in job control sequence. A handle is returned which is used for addressing the job. There can only be one handle open for a particular job at any one time.

DSCheckRoutine

Checks if a BASIC routine is cataloged, either in the VOC as a callable item, or in the global catalog space.

DSDetachJob

This routine is used to give back a JobHandle acqu

DSExecute

Executes a DOS or DataStage Engine command from a before/after subroutine.

DSGetJobInfo

Provides a method of obtaining information about a job, which can be used generally as well as for job control. It can refer to the current job or a controlled job, depending on the value of JobHandle.

DSGetLinkInfo

Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.

DSGetLogEntry

This function is used to read the full event details given in EventId.

DSGetLogSummary

Returns a list of short log event details. The details returned are determined by the setting of some filters. (Care should be taken with the setting of the filters, otherwise a large amount of information can be returned.)

DSGetNewestLogId

This function is used to get the ID of the most recent log event in a particular category, or in any category.

DSGetParamInfo

This function provides a method of obtaining information about a parameter, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.

DSGetProjectInfo

Provides a method of obtaining information about the current project.

DSGetStageInfo

Provides a method of obtaining information about a stage, which can be used generally as well as for job control. It can refer to the current job, or a controlled job, depending on the value of JobHandle.

DSGetStageLinks

Returns a field mark delimited list containing the names of all of the input/output links of the specified stage.

DSLogEvent

This function is used to log an event message to a job other than the current one. (Use DSLogInfo, DSLogFatal, or DSLogWarn to log an event to the current job.)

DSLogFatal

Logs a fatal error message in a job's log file and aborts the job.

DSLogInfo

Logs an information message in a job's log file.

DSLogToController

This routine may be used to put an info message in the log file of the job controlling this job, if any. If there isn't one, the call is just ignored.

DSLogWarn

Logs a warning message in a job's log file.

DSMakeJobReport

Generates a string describing the complete status of a valid attached job.

DSMakeMsg

Insert arguments into a message template. Optionally, it will look up a template ID in the standard DataStage messages file, and use any returned message template instead of that given to the routine.

DSPrepareJob

Used to ensure that a compiled job is in the correct state to be run or validated.

DSRunJob

Used to start a job running. Note that this call is asynchronous; the request is passed to the run-time engine, but you are not informed of its progress.

DSSendMail

This is used for send email

DSSetGenerateOpMetaData

Use this to specify whether the job generates operational meta data or not. This overrides the default setting for the project. In order to generate operational meta data the Process MetaBroker must be installed on your DataStage machine.

DSSetJobLimit

By default a controlled job inherits any row or warning limits from the controlling job. These can, however, be overridden using the DSSetJobLimit function.

DSSetParam

Used to specify job parameter values prior to running a job. Any parameter not set will be defaulted.

DSSetUserStatus

This routine applies only to the current job, and does not take a JobHandle parameter. It can be used by any job in either a JobControl or After routine to set a termination code for interrogation by another job. In fact, the code may be set at any poin

DSStopJob

This routine should only be used after a DSRunJob has been issued. It immediately sends a Stop request to the run-time engine.

DSTransformError

Logs a warning message to a job log file. Called from transforms only.

DSTranslateCode

Converts a job control status or error code into an explanatory text message.

DSWaitForFile

Suspend a job until a named file either exists or does not exist.

DSWaitForJob

This function is only valid if the current job has issued a DSRunJob on the given JobHandle. It returns if that job has started since the last DSRunJobwas issued on it and si

Dtx

Converts a decimal integer to hexadecimal.

Ebcdic

Converts the values of characters in a string from ASCII to EBCDIC format.

Else

Define several blocks of statements and the conditions that determine which block is executed. You can use a single line syntax or multiple lines in a block.

End

Indicates the end of a program, a subroutine, or a block of statements.

Equate

Equates a value to a symbol or a literal string during compilation.

Ereplace

Replaces one or more instances of a substring.

Exchange

Replaces a character in a string.

Exit

Define a program loop.

Exp

Returns the value of "e" raised to the specified power.

Field

Returns delimited substrings in a string.

FieldStore

Modifies character strings by inserting, deleting, or replacing fields separated by specified delimiters.

Fix

Use the FIX function to convert a numeric value to a floatin-point number with a specified precision. FIX lets you control the accuracy of computation by eliminating excess or unreliable data from numeric results.

Fmt

Formats data for output.

FmtDP

In NLS mode, formats data in display positions rather than by character length.

Fold

Folds strings to create substrings.

FoldDP

In NLS mode, folds strings to create substrings using character display positions.

For

Create a For…Next program loop.

GetLocale

In NLS mode, retrieves the current locale setting for a specified category.

GoSub

Transfers program control to an internal subroutine.

GoTo

Transfers program control to the specified statement.

Iconv

Converts a string to an internal storage format.

Execute one or more statements conditionally. You can use a single line syntax or multiple lines in a block.

Index

Returns the starting position of a substring.

InMat

Retrieves the dimensions of an array, or determines if a Dim statement failed due to insufficient memory.

Int

Returns the integer portion of a numeric expression.

IsNull

Tests if a variable contains a null value.

Left

Extracts a substring from the start of a string.

Len

Returns the number of characters in a string.

LenDP

In NLS mode, returns the length of a string in display positions.

Calculates the natural logarithm of the value of an expression, using base "e".

Locate

Use a LOCATE statement to search dynamic.array for expression and to return a value

Loop

Define a program loop.

Mat

Assigns values to the elements of an array.

Match

Compares a string with a format pattern. If NLS is enabled, the result of a match operation depends on the current locale setting of the Ctype and Numeric conventions.

MatchField

Searches a string and returns the part of it that matches a pattern element.

Mod

Returns the remainder after a division operation.

Nap

Pauses a program for the specified number of milliseconds.

Neg

Returns the inverse of a number.

Create a For…Next program loop.

Not

Inverts the logical result of an expression.

Null

Performs no action and generates no object code.

Num

Determines whether a string is numeric. If NLS is enabled, the result of this function depends on the current locale setting of the Numeric convention.

Oconv

Converts an expression to an output format.

Transfer program control to an internal subroutine.

OpenSeq

Opens a file for sequential processing.

Pwr

Raises the value of a number to the specified power.

Quote

Encloses a string in double quotation marks.

Randomize

Generates a repeatable sequence of random numbers in a specified range.

ReadSeq

Reads a line of data from a file opened for sequential processing.

Real

Use the REAL function to convert number into a floating-point number without loss of accuracy. If number evaluates to the null value, null is returned.

Repeat

Define a program loop.

Return

Ends a subroutine and returns control to the calling program or statement.

Right

Extracts a substring from the end of a string.

Rnd

Generates a random number.

Seq

Converts an ASCII character to its numeric code value.

SetLocale

In NLS mode, sets a locale for a specified category.

Sin

returns the sine of an angle. number is the number of degrees in the angle. Sin is the inverse of ASin.

SinH

returns the hyperbolic sine of an angle. number is the number of degrees in the angle.

Sleep

Pauses a program for the specified number of seconds.

Soundex

Generates codes that can be used to compare character strings based on how they sound.

Space

Returns a string containing the specified number of blank spaces.

Sqrt

Returns the square root of a number.

SQuote

Encloses a string in single quotation marks.

Status

Returns a code that provides information about how a preceding function was executed.

Str

Composes a string by repeating the input string the specified number of times.

Substring

Returns a substring of a string.

Substrings

Returns a substring of a string.

Tan

returns the hyperbolic tangent of an angle. number is the number of degrees in the angle.

TanH

returns the hyperbolic tangent of an angle. number is the number of degrees in the angle.

Then

Execute one or more statements conditionally. You can use a single line syntax or multiple lines in a block.

Time

Returns the internal system time.

TimeDate

Returns the system time and date. If NLS is enabled, the result of this function depends on the current locale setting of the Time convention.

Trim

Trims unwanted characters from a string.

TrimB

Trims trailing spaces from a string.

TrimF

Trims leading spaces and tabs from a string.

UniChar

In NLS mode, generates a single character in Unicode format.

UniSeq

In NLS mode, converts a Unicode character to its equivalent decimal value.

Until

Define a program loop.

UpCase

Changes lowercase letters in a string to uppercase. If NLS is enabled, the result of this function depends on the current locale setting of the Ctype convention.

WEOFSeq

Writes an end-of-file mark in an open sequential file.

While

Define a program loop.

WriteSeq

Writes a new line to a file that is open for sequential processing and advances a pointer to the next position in the file.

WriteSeqF

Writes a new line to a file that is open for sequential processing, advances a pointer to the next position in the file, and saves the file to disk.

Xtd

Converts a hexadecimal string to decimal.

Sandy's DataStage Notes

Sunday, July 31, 2011

17 Mistakes that ETL Designers make with Very Large Data

Friday, July 29, 2011

Using DataStage 8 Parameter Sets to Tame Environment Variables

Another good article from Vincent Mc. Burney :

Thursday, July 28, 2011

DS Command

Error : DSR_ExecJob (Action=5) when running a job

I got how to resolve this issue from IBM Support

Problem(Abstract)

Cause

Diagnosing the problem

Resolving the problem

Wednesday, July 27, 2011

How to solve Error:No conductor nodes found in export node pool

Sunday, July 24, 2011

10 Ways to Make DataStage Run Slower

1. Use the same configuration file for all your jobs.

2. Use a sparse database lookup on high volumes.

3. Keep resorting your data.

4. Design single threaded bottlenecks

5. Turn on debugging and forget that it's on

6. Let the disks look after themselves

7. Keep Writing that Data to Disk

8. Validate every field

9. Write extra steps in database code

10. Don't do Performance Testing

How to check Runtime Information to performance tuning?

I've got this good article from IBM Info center :

Runtime information

How to debugging your DataStage Parallel Jobs

Steps

Saturday, July 23, 2011

How does Hash Partition works in Datastage | Datastage Tutorials-Datastage ETL Tool

How to release Read Only Job?

Rob Vonk's Function : GetXMLProperty

DataStage Functions and Routines

Total Pageviews