Sunday, July 31, 2011

17 Mistakes that ETL Designers make with Very Large Data

Again, Vincent McBurney had written another good article :D,

17 Mistakes that ETL Designers make with Very Large Data
These are the mistakes that ETL designers can make when processing scary high data volumes.
Dan Lindstedt is a very large data guru and he has a series of outstanding blog posts on very large databases, the latest is ETL Engines: VLDW & Loading / Transforming.  I first came across Dan on ETL/ELT forums where he has over 20,000 forum posts.  He popped up on B-Eye-Network blogs.  Dan has had no comments on his latest post series yet as B-Eye don't have a reader friendly design and it discourages reader participation.  For example I just got three comments over a weekend on three old archived ETL posts.  ITToolbox is a friendly place for reader participation.
My favourite part of Dan's latest post is the 17 mistakes that ETL Designers make with very large data.  In fact that's the title I would have gone for!  For some reason blog titles with a number in them attract more hits.  I've shown Dan's list of 17 below with my own comments on how that impacts DataStage developers.  I would love to hear your own contributions for common ETL design mistakes.
1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
Agree 100%.  I believe in have at least three bands in your processing: Extract from source to file, process file to load ready dataset and load load ready dataset.  For a standard target table load I would have an insert job, an update job, a delete job (if needed), a bulk load job (for large volumes). 
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
I can see this working well for smaller volumes - and the combination of DataStage accessing data from multiple systems via a Federation Server plugin is intriguing, but this type of cross database joining would be nasty on very large volumes.  Pulling the smaller data volumes into target tables or lookup datasets would be faster.
3) Targeting more than 1 or 2 target tables
Agree 100%.  You could take it to two tables if there is a very close parent/child relationship like a master table and an attribute table, or a header table and a detail table.  But that's the exception, most large volume ETL jobs should be preparing data for one target table.
4) moving rows that are too wide through a single process
I don't know how you get around this one.  If you have a lot of columns then you gotta get it in there!
5) loading very large data sets to targets WITH INDEXES ON
DataStage makes this easy to manage, on a standard database tab (insert or bulk load activities) you can use the before-SQL tab to turn indexes off and the after-SQL tab to turn them back on.  The statements on those tabs are run just once for each job (and not per row).  You don't need indexes and keys if you have other ways to check your referential integrity.
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
Examples of why you need to be in the good books with your DBAs! 
"My DataStage job is running slow, I think there is something wrong with the database table."
"You sure it's not DataStage?"
"Can you check the table, maybe run a trace?"
"What is DataStage?"
"Could you just check the table, I've got the job running right now!"
"I ran a query on the table and it's running fine."
"Could you maybe get off Facebook and do a friggen trace!"
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
DataStage Parallel Jobs need a lot of RAM for the lookup stage.  The IBM Information Server Blade starts with 4G of RAM per blade for two dual core CPUs on each blade.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
I think the treblification of this one is an OS swapping joke.  I am going to be talking about 64 bit processing with the IBM architects when I meet them at IoD 2007 next month.  The Information Server can run on some 64 bit environments but it will be interesting to find out what plans IBM have for expanding this.
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
This is a tricky one - kind of like rows that are too wide - sometimes you need to do it and you hope the massively parallel architecture is up to it.  DataStage can make almost every type of stage work in parallel so it can get away with extra steps, however this does make the job harder to debug and sometimes you get those random out of resource errors...
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
I'm not sure what Dan means by multi-passing and I'll ask in his comments thread.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
More detail about this is in Dan's post.  If you try to trap database rejects in a Server Job you use a reject link from a Transformer prior to the database stage.  Dan points out that each reject row slows down the job by a factor of 4 as the ETL job stops processing to handle that reject row. 
Parallel Jobs are more efficient as they use a reject link out of the database stage and on a parallel architecture can push the handling of those rows into a new process.  I haven't seen any overhead in this design if you don't get rejects, and database rejects should be under 0.01% of your rows or else your design is faulty. 
I've never try to trap bulk load or multi load or tpump or any other type of native database load errors back in DataStage and I don't think the stages can accept them anyway, I let the database handle them.  On smaller volume jobs I use the database reject link and insert instead of bulk load a lot for a more robust auditing of rows.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
A bit hard to avoid in a row-by-row ETL tool!  The parallel architecture and caching and memory sharing and a bunch of other things make it fast.
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).
Fortunately this is very hard to do in a DataStage job or people would misuse it!  In DataStage looping on a per row basis can be done via lookups to database stored procedures or custom code modules but most standard DataStage stages do an action per row.

Think outside the box.
A lot of ETL sites have templates for building new jobs or design standards or example jobs.  These are excellent for 90% of your ETL work however very large data jobs may need custom designs.  You should look at what has been done in other jobs but also be ready to take a job into performance testing to try out dozens of other configurations.  As Dan says in his post - "performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines".
A great series of posts from Dan with some cool Oracle and SQL Server tips in earlier posts.
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Friday, July 29, 2011

Using DataStage 8 Parameter Sets to Tame Environment Variables

Another good article from Vincent Mc. Burney :


Using DataStage 8 Parameter Sets to Tame Environment Variables
Special Team Parameter Sets can remove some of the mystery from DataStage Parallel Job Environment Variables.
In a previous post I looked at How to Create, Use and Maintain DataStage 8 Parameter Sets.  In this second of three posts on Parameter Sets I look at combining Environments Variables with Parameter Sets and in the final post I look at User Defined Environment Variables and Parameter Sets.
Parameter Sets have the potential to make Environment Variables much easier to add to jobs and easier to use across a large number of jobs.

Environment Variables and Parameter Sets
Environment Variables are set every time you log into a computer.  They are set for all Unix, Linux and Windows logins and you can see them if you type in "Env" from a prompt.  DataStage Parallel Jobs have a special set of Environment Variables that get added during a DataStage installation and they are exposed through the DataStage Administrator so you can edit them more easily.  You can see most of them documented in Chapter 6 of the Parallel Job Advanced Developers Guide.
There used to be just two ways to set an environment variable for a job. 
1)  In the Administrator tool you set it centrally and that impacts every job that runs, this variable gets set in the session that starts any DataStage job:
2)  In the DataStage Designer you add the same environment variable to a job via the Add Job Parameter screen and use it like an override just for that job: 
Let's look at some of the problems with these environment variables prior to version 8:
  1. Neither of those screenshots above shows a very friendly user interface.  The parameters are in a long list, they have long and technical names, it's hard to work out how different parameters relate to each other.
  2. You can bring in a DataStage guru who can spend weeks fine tuning your Environment Variable values for you in a performance testing environment - however it only takes one dunce to come along and using Administrator to change a setting and lose all that value.
  3. It's very time consuming to add these environment variables to jobs.
  4. If you use Sequence Jobs you will find yourself having to pass through values from the Sequence job level to the parallel job level in the Job Activity stage properties for every single parameter leading to lots of time spent configuring Sequence jobs.

Parameter Sets change all this.  Imagine if you could add Environment Variables to a job choosing from a shorter list with a group of environment variables and a name that indicates what that group of variables is trying to achieve:
By creating some "special team" Parameter Sets and adding environment variables to them we simplify the creation and management of these values.  A DataStage parallel guru sets them up at the beginning of a project, they are performance tested to verify they work and then all developers who follow can benefit from using those Parameter Sets.  You need to recompiled the job if you add or remove a Parameter Set or a parameter from a Parameter Set but apart from that no changes to the job are necessary.
I have created some example Parameter Sets full of Environment Variables to illustrate how this works.  The first two scenarios show how to create a Parameter Set for very high and very low volume jobs.  This lets you setup your project wide variables to suit medium jobs or "all comers" and lets you override specific settings for the extremes of data volumes.

High Volume Job
The idea here is you choose a typical high volume job and test the hell out of it using all the DataStage reporting and performance monitoring software and then via trial and error you tune some environment variables in a Parameter Set to deliver faster performance.  You then apply that Parameter Set to all similar high volume jobs.
Testing will show whether you can use on Parameter Set for all high volume jobs or whether you need different Parameter Sets for different types of jobs - such as those that write to file versus those that write to a database.
For high volume jobs the first environment variables to look at are:
  • $APT_CONFIG_FILE: lets you define the biggest config file with the most number of nodes.
  • $APT_SCORE_DUMP: when switched on it creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators.  Useful for finding out what your high volume job is doing.
  • $APT_PM_PLAYER_TIMING: this reporting option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming.  Good for spotting bottlenecks.
One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job.  This is done by turning off auto sorting in high volume jobs:
  • APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates.  You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.
Buffering is another thing that can be tweaked, it controls how data is passed between stages, usually you just leave it alone but on a very high volume job you might want custom settings:
  • APT_BUFFER_MAXIMUM_MEMORY: Sets the default value of Maximum memory buffer size.
  • APT_BUFFER_DISK_WRITE_INCREMENT: For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size may be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting may not exceed max_memory * 2/3.
  • APT_IO_MAXIMUM_OUTSTANDING: Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number may need to be increased.
  • APT_FILE_EXPORT_BUFFER_SIZE: if your high volume jobs are writing to sequential files you may be overheating your file system, increasing the size of this value can deliver data to files in bigger chunks to combat long latency.
These are just some of the I/O and buffering settings.


Low Volume Job
By default a low volume job will tend to run slowly on a massively scalable DataStage server. 
Many less environment variables to set as low volume jobs don't need any special configuration.  Just make sure the job is not trying to partition data as that could be overkill when you don't have a lot of data to process.  Partitioning and repartitioning data on volumes of less than 1000 rows makes the job start and stop more slowly:
  • APT_EXECUTION_MODE: By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS, MANY_PROCESS and NO_SERLIALIZE.
  • $APT_CONFIG_FILE: lets you define a config file that will run these little jobs on just one node so they don't try any partitioning and repartitioning.
  • $APT_IO_MAXIMUM_OUTSTANDING: when a job starts on a node it is allocated some memory for network communications - especially the partitioning and repartitioning between nodes.  This is set to 2MB but when you have a squadron of very small jobs that don't partition you can reduce this size to make the job start faster and free up RAM memory.
Other Parameter Sets
You can set up all your default project Environment Variables to handle all data volumes in between.  You can still have a Parameter Set for medium volume jobs if you have specific config files you want to use. 
You might also create a ParameterSet called PX_MANY_STAGES which is for any job that has dozens of stages in it regardless of data volumes.
  • APT_THIN_SCORE: Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.
This can be combined with the large volume Parameter Set in a job so you have extra configuration for high volume jobs with many stages.
You might also create a ParameterSet for a difficult type of source data file when default values don't work, eg. PX_MFRAME_DATA:
  • APT_EBCDIC_VERSION: Certain operators, including the import and export operators, support the €Ċ“ebcdic€� property specifying that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use.
  • APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL: When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.
SAS is another operator that has a lot of configurable environment variables because when you are reading or writing native SAS datasets or running a SAS transformation you are handing some of the control over to SAS - these environment variables configure this interaction:
  • APT_HASH_TO_SASHASH: can output data hashed using sashash - the hash algorithm used by SAS.
  • APT_SAS_ACCEPT_ERROR: When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.
  • APT_NO_SAS_TRANSFORMS: WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.
You can group all known debug parameters into a single debug file to make it easier for support to find:
  • APT_SAS_DEBUG: Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.  Don't put this into your SAS Parameter Set as the support team might not be able to find it or know it exists.
  • APT_SAS_DEBUG_IO: Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.
  • APT_SAS_SCHEMASOURCE_DUMP: When using SAS Schema Source, sauses the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.
So a new developer who is handed a high volume job does not need to know anything about environment variables, they just need to add the right ParameterSet to the job.  And if an experienced developer decides a new environment variable needs to be added to high volume jobs they just add it to the central ParameterSet and recompile all the jobs that use it.  The "Where Used" function will help identify those jobs.
ParameterSets and environment variables make a powerful combination.  ParameterSets can act as a layer that simplifies environment parameters and makes them easier to add to jobs.

Thursday, July 28, 2011

DS Command

This is some DS commands, but we have to use it wisely especially for unexperienced user, because it sometimes can make your DS become unstable.

DS.TOOLS Menu 
Option Text on Screen 
1 Report on project licenses 
2 Rebuild Repository indices 
3 Set up server-side tracing >> 
4 Check integrity of job files 
5 Administer processes/locks >> 
6 Adjust job tunable properties 

DS.TRACE Menu 
Option Text on Screen 
1 Query trace status 
2 Turn tracing on 
3 Turn tracing off 
4 List trace records 
5 Delete all trace records 

DS.PLADMIN Menu 
Option Text on Screen 
1 List all processes 
2 List state of a process 
3 List state of all processes in a job 
4 List all locks 
5 List locks held by a process 
6 List locks held by all processes in a job 
7 Clear locks held by a process 
8 Clear locks held by all processes in a job 
9 Logout a process 
10 Logout all processes in a job 
11 Clear status file for a job 


DS.TOOLS Menu 
Option Command 
1 DSR_LICENSE REPORT 
2 DS.REINDEX ALL 
3 DS.TRACE 
4 DS.CHECKER JOBS 
5 DS.PLADMIN 
6 DS.TUNEPROPS 

DS.TRACE Menu 
Option Command 
1 TRACE 
2 TRACE ON 
3 TRACE OFF 
4 TRACE LIST 
5 TRACE CLEAR 

DS.PLADMIN Menu 
1 DS.PLADMIN.CMD LIST PIDS * 
2 DS.PLADMIN.CMD LIST PIDS 
3 DS.PLADMIN.CMD LIST PIDS 
4 DS.PLADMIN.CMD LIST LOCKS * 
5 DS.PLADMIN.CMD LIST LOCKS 
6 DS.PLADMIN.CMD LIST LOCKS 
7 DS.PLADMIN.CMD CLEAR LOCKS 
8 DS.PLADMIN.CMD CLEAR LOCKS 
9 DS.PLADMIN.CMD CLEAR PIDS 
10 DS.PLADMIN.CMD CLEAR PIDS 
11 DS.PLADMIN.CMD CLEAR STATUS 


Error : DSR_ExecJob (Action=5) when running a job

I got how to resolve this issue from IBM Support

Problem(Abstract)

In DataStage Director, the following errors occur when try to view folders in a project:

Error calling subroutine: DSR_EXECJOB (Action=5); check DataStage is set up correctly in project <project name> (The connection is broken (81002))


Error calling subroutine: *DataStage*DSR_SELECT (Action=3); check DataStage is set up correctly in project <project name> (The connection is broken (81002))

Cause

The errors were received because the Logging Agent was not running.

Diagnosing the problem


  1. Check permission's on the project. Verify the user has RWX permission's on the files in the project.
  2. Verify the Agents, Logging Agent and ASB Agent, are running:

Resolving the problem

To start the Agents:

Wednesday, July 27, 2011

How to solve Error:No conductor nodes found in export node pool

If you find this error when you want to write to a sequential stage, it usually happens when conductor node has different pool with other (process) nodes or APT_PM_CONDUCTOR_NODENAME may has been set manually.

To solve this problem, you may try to add machine_name before the file path. For example:
<Machine_Name>:<Filename Path>
eq: ETL_Machine_1:/my_etl_source/output/my_dest_filename.txt


Sunday, July 24, 2011

10 Ways to Make DataStage Run Slower


I've good article from one of my favourite datastage's guru, Vincent McBurney, it's hillarious but really get the point and must remember by all datastage developers :

Vincent said:
"Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.

1. Use the same configuration file for all your jobs.

You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!


2. Use a sparse database lookup on high volumes.

This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!


3. Keep resorting your data.

Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!


4. Design single threaded bottlenecks

This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.


5. Turn on debugging and forget that it's on

In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!


6. Let the disks look after themselves

Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.


7. Keep Writing that Data to Disk

Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.


8. Validate every field

A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.


9. Write extra steps in database code

The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".


10. Don't do Performance Testing

Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way."
Thanks a lot Vincent,Sandy

How to check Runtime Information to performance tuning?


I've got this good article from IBM Info center :

Runtime information

When you set the APT_PM_PLAYER_TIMING environment variable, information is provided for each operator in a job flow. This information is written to the job log when the job is run.
An example output is:
##I TFPM 000324 08:59:32(004) <generator,0> Calling runLocally: step=1, node=rh73dev04, op=0, ptn=0
##I TFPM 000325 08:59:32(005) <generator,0> Operator completed. status: APT_StatusOk  elapsed: 0.04  
user: 0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(006) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1, ptn=0
##I TFPM 000325 08:59:32(012) <peek,0> Operator completed. status: APT_StatusOk  elapsed: 0.01  user: 
0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(013) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1, ptn=1
##I TFPM 000325 08:59:32(019) <peek,1> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 
0.00  sys: 0.00  suser: 0.09  ssys: 0.02 (total CPU: 0.11)¨
This output shows that each partition of each operator has consumed about one tenth of a second of CPU time during its runtime portion. In a real world flow, we'd see many operators, and many partitions.
It is often useful to see how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it might mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.
If one operator is using a much larger portion of the CPU than others, it might be an indication that you've discovered a problem in your flow. Common sense is generally required here; a sort is going to use dramatically more CPU time than a copy. This will, however, give you a sense of which operators are the CPU hogs, and when combined with other metrics presented in this document can be very enlightening.
Setting the environment variable APT_DISABLE_COMBINATION might be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care.
Unlike the job monitor cpu percentages, setting APT_PM_PLAYER_TIMING will provide timings on every operator within the flow.

How to debugging your DataStage Parallel Jobs

Here is a good article about tips for debugging a datastage job :



Steps

Enable the following environment variables in DataStage Administrator:
  • APT_PM_PLAYER_TIMING – shows how much CPU time each stage uses
  • APT_PM_SHOW_PIDS – show process ID of each stage
  • APT_RECORD_COUNTS – shows record counts in log
  • APT_CONFIG_FILE – switch configuration file (one node, multiple nodes)
  • OSH_DUMP – shows OSH code for your job. Shows if any unexpected settings were set by the GUI.
  • APT_DUMP_SCORE – shows all processes and inserted operators in your job
  • APT_DISABLE_COMBINATION – do not combine multiple stages in to one process. Disabling this will make it easier to see where your errors are occurring.
  • Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead.
  • Use row generator stage to generate sample data.
  • Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH&
  • To catch partitioning problems, run your job with a single node configuration file and compare the output with your multi-node run. You can just look at the file size, or sort the data for a more detailed comparison (Unix sort + diff commands).

Saturday, July 23, 2011

How does Hash Partition works in Datastage | Datastage Tutorials-Datastage ETL Tool

You can read a full reference about this blog in:
How does Hash Partition works in Datastage | Datastage Tutorials-Datastage ETL Tool

Happy learning,
Sandy

How to release Read Only Job?

On the Administrator command:
UPDATE DS_JOBOBJECTS SET READONLY = 'NRO' WHERE READONLY = 'RO'

Rob Vonk's Function : GetXMLProperty

Source: Rob Vonk's GetXMLProperty Function

Name: GetXMLProperty
Author: Rob Vonk
Language: Datastagebasic
Description:
Get property value from XML part. I use it to get data from Datastage XML exports
Snippet:
Function GetXMLProperty(XMLPart,PropertyName)

Result = ""

Property = '<Property Name="' : PropertyName : '">'

P = Index(Iconv(XMLPart,"MCU"),Iconv(Property,"MCU"),1)

If P > 0 then
  TempXML = XMLPart[P + Len(Property),Len(XMLPart)]
  P = Index(TempXML,'<',1)
  If P > 0 then
     TempXML = TempXML[1,P-1]
     Result = TempXML
  end
end

Ans = Result

DataStage Functions and Routines

Rob Vonk's article about DataStage Functions and Routines:




$Define


Defines identifiers that control program compilation or supplies replacement text for an identifier.



$IfDef


Tests an identifier to see if it is defined or not defined.



$IfNDef


Tests an identifier to see if it is defined or not defined.



$Include


Inserts source code contained in a separate file and compiles it along with the main program.



$Undefine


Removes an identifier that was set using the $Define statement. If no identifier is set, $Undefine has no effect.



Abs


Returns the absolute (unsigned) value of a number.



ACos


returns the arc-cosine of number in degrees. ACos is the inverse of Cos.



Alpha


Checks if a string is alphabetic. If NLS is enabled, the result of this function is dependent on the current locale setting of the Ctype convention.



Ascii


Converts the values of characters in a string from EBCDIC to ASCII.



ASin


returns the arc-sine of number in degrees. ASin is the inverse of Sin.



ATan


returns the arc-tangent of number in degrees. ATan is the inverse of Tan.



BitAnd


compares two integers bit by bit. For each bit, it returns bit 1 if both bits are 1; otherwise it returns bit 0.



BitNot


inverts the bits in an integer, that is, changes bit 1 to bit 0, and vice versa. If bit.number is specified, that bit is inverted; otherwise all bits are inverted.



BitOr


compares two integers bit by bit. For each bit, it returns bit 1, if either or both bits is 1; otherwise it returns bit 0.



BitReset


resets the specified bit to 0. If it is already 0, it is not changed.



BitSet


sets the specified bit to 1. If it is already 1, it is not changed.



BitTest


tests if the specified bit is set. It returns 1 if the bit is set; 0 if it is not.



BitXOr


compares two integers bit by bit. For each bit, it returns bit 1 if only one of the two bits is 1; otherwise it returns bit 0.



Byte


Lets you build a string byte by byte.



ByteLen


Returns the length of a string in bytes



ByteType


Determines the internal function of a particular byte.



ByteVal


Determines the value of a particular byte in a string.



Call


Calls a subroutine.



Case


Alters the sequence of execution in the program according to the value of an expression.



Cats


Concatenates two strings.



Change


Replaces one or more instances of a substring.



Char


Generates an ASCII character from its numeric code value.



Checksum


Returns a checksum value for a string.



CloseSeq


Closes a file after sequential processing.



Col1


Returns the character position preceding the substring specified in the most recently executed Field function.



Col2

Returns the character position following the substring specified in the most recently executed Field function.



Common


Defines a common storage area for variables.



Compare


Compares two strings. If NLS is enabled, the result of this function depends on the current locale setting of the Collate convention.



Convert


Replaces every instance of specified characters in a string with substitute characters.



Cos


returns the cosine of an angle. number is the number of degrees in the angle. Cos is the inverse of ACos.



CosH


returns the hyperbolic cosine of an angle. number is the number of degrees in the angle.



Count


Counts the number of times a substring occurs in a string.



CRC


Returns a 32-bit cyclic redundancy check value for a string.



CRC32


Returns a 32-bit cyclic redundancy check value for a string.



Date


Returns a date in its internal system format.



DCount


Counts delimited fields in a string.



Deffun


Defines a user-written function.



Dimension


Defines the dimensions of one or more arrays.



Div


Divides one number by another.



DownCase


Converts uppercase letters in a string to lowercase. If NLS is enabled, the result of this function depends on the current locale setting of the Ctype convention.



DQuote


Encloses a string in double quotation marks.



DSAttachJob


Attaches to a job in order to run it in job control sequence. A handle is returned which is used for addressing the job. There can only be one handle open for a particular job at any one time.



DSCheckRoutine


Checks if a BASIC routine is cataloged, either in the VOC as a callable item, or in the global catalog space.



DSDetachJob


This routine is used to give back a JobHandle acqu



DSExecute


Executes a DOS or DataStage Engine command from a before/after subroutine.



DSGetJobInfo


Provides a method of obtaining information about a job, which can be used generally as well as for job control. It can refer to the current job or a controlled job, depending on the value of JobHandle.



DSGetLinkInfo


Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.



DSGetLogEntry


This function is used to read the full event details given in EventId.



DSGetLogSummary


Returns a list of short log event details. The details returned are determined by the setting of some filters. (Care should be taken with the setting of the filters, otherwise a large amount of information can be returned.)



DSGetNewestLogId


This function is used to get the ID of the most recent log event in a particular category, or in any category.



DSGetParamInfo


This function provides a method of obtaining information about a parameter, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.



DSGetProjectInfo


Provides a method of obtaining information about the current project.



DSGetStageInfo


Provides a method of obtaining information about a stage, which can be used generally as well as for job control. It can refer to the current job, or a controlled job, depending on the value of JobHandle.



DSGetStageLinks


Returns a field mark delimited list containing the names of all of the input/output links of the specified stage.



DSLogEvent


This function is used to log an event message to a job other than the current one. (Use DSLogInfo, DSLogFatal, or DSLogWarn to log an event to the current job.)



DSLogFatal


Logs a fatal error message in a job's log file and aborts the job.



DSLogInfo


Logs an information message in a job's log file.



DSLogToController


This routine may be used to put an info message in the log file of the job controlling this job, if any. If there isn't one, the call is just ignored.



DSLogWarn


Logs a warning message in a job's log file.



DSMakeJobReport


Generates a string describing the complete status of a valid attached job.



DSMakeMsg


Insert arguments into a message template. Optionally, it will look up a template ID in the standard DataStage messages file, and use any returned message template instead of that given to the routine.



DSPrepareJob


Used to ensure that a compiled job is in the correct state to be run or validated.



DSRunJob


Used to start a job running. Note that this call is asynchronous; the request is passed to the run-time engine, but you are not informed of its progress.



DSSendMail


This is used for send email



DSSetGenerateOpMetaData


Use this to specify whether the job generates operational meta data or not. This overrides the default setting for the project. In order to generate operational meta data the Process MetaBroker must be installed on your DataStage machine.



DSSetJobLimit


By default a controlled job inherits any row or warning limits from the controlling job. These can, however, be overridden using the DSSetJobLimit function.



DSSetParam


Used to specify job parameter values prior to running a job. Any parameter not set will be defaulted.



DSSetUserStatus


This routine applies only to the current job, and does not take a JobHandle parameter. It can be used by any job in either a JobControl or After routine to set a termination code for interrogation by another job. In fact, the code may be set at any poin



DSStopJob


This routine should only be used after a DSRunJob has been issued. It immediately sends a Stop request to the run-time engine.



DSTransformError


Logs a warning message to a job log file. Called from transforms only.



DSTranslateCode


Converts a job control status or error code into an explanatory text message.



DSWaitForFile


Suspend a job until a named file either exists or does not exist.



DSWaitForJob


This function is only valid if the current job has issued a DSRunJob on the given JobHandle. It returns if that job has started since the last DSRunJobwas issued on it and si



Dtx


Converts a decimal integer to hexadecimal.



Ebcdic


Converts the values of characters in a string from ASCII to EBCDIC format.



Else


Define several blocks of statements and the conditions that determine which block is executed. You can use a single line syntax or multiple lines in a block.



End


Indicates the end of a program, a subroutine, or a block of statements.



Equate


Equates a value to a symbol or a literal string during compilation.



Ereplace


Replaces one or more instances of a substring.



Exchange


Replaces a character in a string.



Exit


Define a program loop.



Exp


Returns the value of "e" raised to the specified power.



Field


Returns delimited substrings in a string.



FieldStore


Modifies character strings by inserting, deleting, or replacing fields separated by specified delimiters.



Fix


Use the FIX function to convert a numeric value to a floatin-point number with a specified precision. FIX lets you control the accuracy of computation by eliminating excess or unreliable data from numeric results.



Fmt


Formats data for output.



FmtDP


In NLS mode, formats data in display positions rather than by character length.



Fold


Folds strings to create substrings.



FoldDP


In NLS mode, folds strings to create substrings using character display positions.



For


Create a For…Next program loop.



GetLocale


In NLS mode, retrieves the current locale setting for a specified category.



GoSub


Transfers program control to an internal subroutine.



GoTo


Transfers program control to the specified statement.



Iconv


Converts a string to an internal storage format.



If


Execute one or more statements conditionally. You can use a single line syntax or multiple lines in a block.



Index


Returns the starting position of a substring.



InMat


Retrieves the dimensions of an array, or determines if a Dim statement failed due to insufficient memory.



Int


Returns the integer portion of a numeric expression.



IsNull


Tests if a variable contains a null value.



Left


Extracts a substring from the start of a string.



Len


Returns the number of characters in a string.



LenDP


In NLS mode, returns the length of a string in display positions.



Ln


Calculates the natural logarithm of the value of an expression, using base "e".



Locate


Use a LOCATE statement to search dynamic.array for expression and to return a value



Loop


Define a program loop.



Mat


Assigns values to the elements of an array.



Match


Compares a string with a format pattern. If NLS is enabled, the result of a match operation depends on the current locale setting of the Ctype and Numeric conventions.



MatchField


Searches a string and returns the part of it that matches a pattern element.



Mod


Returns the remainder after a division operation.



Nap


Pauses a program for the specified number of milliseconds.



Neg


Returns the inverse of a number.



Next


Create a For…Next program loop.



Not


Inverts the logical result of an expression.



Null


Performs no action and generates no object code.



Num


Determines whether a string is numeric. If NLS is enabled, the result of this function depends on the current locale setting of the Numeric convention.



Oconv


Converts an expression to an output format.



On


Transfer program control to an internal subroutine.



OpenSeq


Opens a file for sequential processing.



Pwr


Raises the value of a number to the specified power.



Quote


Encloses a string in double quotation marks.



Randomize


Generates a repeatable sequence of random numbers in a specified range.



ReadSeq


Reads a line of data from a file opened for sequential processing.



Real


Use the REAL function to convert number into a floating-point number without loss of accuracy. If number evaluates to the null value, null is returned.



Repeat


Define a program loop.



Return


Ends a subroutine and returns control to the calling program or statement.



Right


Extracts a substring from the end of a string.



Rnd


Generates a random number.



Seq


Converts an ASCII character to its numeric code value.



SetLocale


In NLS mode, sets a locale for a specified category.



Sin


returns the sine of an angle. number is the number of degrees in the angle. Sin is the inverse of ASin.



SinH


returns the hyperbolic sine of an angle. number is the number of degrees in the angle.



Sleep


Pauses a program for the specified number of seconds.



Soundex


Generates codes that can be used to compare character strings based on how they sound.



Space


Returns a string containing the specified number of blank spaces.



Sqrt


Returns the square root of a number.



SQuote


Encloses a string in single quotation marks.



Status


Returns a code that provides information about how a preceding function was executed.



Str


Composes a string by repeating the input string the specified number of times.



Substring


Returns a substring of a string.



Substrings


Returns a substring of a string.



Tan


returns the hyperbolic tangent of an angle. number is the number of degrees in the angle.



TanH


returns the hyperbolic tangent of an angle. number is the number of degrees in the angle.



Then


Execute one or more statements conditionally. You can use a single line syntax or multiple lines in a block.



Time


Returns the internal system time.



TimeDate


Returns the system time and date. If NLS is enabled, the result of this function depends on the current locale setting of the Time convention.



Trim


Trims unwanted characters from a string.



TrimB


Trims trailing spaces from a string.



TrimF


Trims leading spaces and tabs from a string.



UniChar


In NLS mode, generates a single character in Unicode format.



UniSeq


In NLS mode, converts a Unicode character to its equivalent decimal value.



Until


Define a program loop.



UpCase


Changes lowercase letters in a string to uppercase. If NLS is enabled, the result of this function depends on the current locale setting of the Ctype convention.



WEOFSeq


Writes an end-of-file mark in an open sequential file.



While


Define a program loop.



WriteSeq


Writes a new line to a file that is open for sequential processing and advances a pointer to the next position in the file.



WriteSeqF


Writes a new line to a file that is open for sequential processing, advances a pointer to the next position in the file, and saves the file to disk.



Xtd


Converts a hexadecimal string to decimal.