Another good article from Vincent Mc. Burney :
Using DataStage 8 Parameter Sets to Tame Environment Variables
Special Team Parameter Sets can remove some of the mystery from DataStage Parallel Job Environment Variables.
In a previous post I looked at How to Create, Use and Maintain DataStage 8 Parameter Sets. In this second of three posts on Parameter Sets I look at combining Environments Variables with Parameter Sets and in the final post I look at User Defined Environment Variables and Parameter Sets.
Parameter Sets have the potential to make Environment Variables much easier to add to jobs and easier to use across a large number of jobs.
Environment Variables and Parameter Sets
Environment Variables are set every time you log into a computer. They are set for all Unix, Linux and Windows logins and you can see them if you type in "Env" from a prompt. DataStage Parallel Jobs have a special set of Environment Variables that get added during a DataStage installation and they are exposed through the DataStage Administrator so you can edit them more easily. You can see most of them documented in Chapter 6 of the Parallel Job Advanced Developers Guide.
There used to be just two ways to set an environment variable for a job.
1) In the Administrator tool you set it centrally and that impacts every job that runs, this variable gets set in the session that starts any DataStage job:
2) In the DataStage Designer you add the same environment variable to a job via the Add Job Parameter screen and use it like an override just for that job:
Let's look at some of the problems with these environment variables prior to version 8:
- Neither of those screenshots above shows a very friendly user interface. The parameters are in a long list, they have long and technical names, it's hard to work out how different parameters relate to each other.
- You can bring in a DataStage guru who can spend weeks fine tuning your Environment Variable values for you in a performance testing environment - however it only takes one dunce to come along and using Administrator to change a setting and lose all that value.
- It's very time consuming to add these environment variables to jobs.
- If you use Sequence Jobs you will find yourself having to pass through values from the Sequence job level to the parallel job level in the Job Activity stage properties for every single parameter leading to lots of time spent configuring Sequence jobs.
Parameter Sets change all this. Imagine if you could add Environment Variables to a job choosing from a shorter list with a group of environment variables and a name that indicates what that group of variables is trying to achieve:
By creating some "special team" Parameter Sets and adding environment variables to them we simplify the creation and management of these values. A DataStage parallel guru sets them up at the beginning of a project, they are performance tested to verify they work and then all developers who follow can benefit from using those Parameter Sets. You need to recompiled the job if you add or remove a Parameter Set or a parameter from a Parameter Set but apart from that no changes to the job are necessary.
I have created some example Parameter Sets full of Environment Variables to illustrate how this works. The first two scenarios show how to create a Parameter Set for very high and very low volume jobs. This lets you setup your project wide variables to suit medium jobs or "all comers" and lets you override specific settings for the extremes of data volumes.
High Volume Job
The idea here is you choose a typical high volume job and test the hell out of it using all the DataStage reporting and performance monitoring software and then via trial and error you tune some environment variables in a Parameter Set to deliver faster performance. You then apply that Parameter Set to all similar high volume jobs.
Testing will show whether you can use on Parameter Set for all high volume jobs or whether you need different Parameter Sets for different types of jobs - such as those that write to file versus those that write to a database.
For high volume jobs the first environment variables to look at are:
- $APT_CONFIG_FILE: lets you define the biggest config file with the most number of nodes.
- $APT_SCORE_DUMP: when switched on it creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators. Useful for finding out what your high volume job is doing.
- $APT_PM_PLAYER_TIMING: this reporting option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming. Good for spotting bottlenecks.
One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job. This is done by turning off auto sorting in high volume jobs:
- APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates. You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.
Buffering is another thing that can be tweaked, it controls how data is passed between stages, usually you just leave it alone but on a very high volume job you might want custom settings:
- APT_BUFFER_MAXIMUM_MEMORY: Sets the default value of Maximum memory buffer size.
- APT_BUFFER_DISK_WRITE_INCREMENT: For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size may be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting may not exceed max_memory * 2/3.
- APT_IO_MAXIMUM_OUTSTANDING: Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number may need to be increased.
- APT_FILE_EXPORT_BUFFER_SIZE: if your high volume jobs are writing to sequential files you may be overheating your file system, increasing the size of this value can deliver data to files in bigger chunks to combat long latency.
These are just some of the I/O and buffering settings.
Low Volume Job
By default a low volume job will tend to run slowly on a massively scalable DataStage server.
Many less environment variables to set as low volume jobs don't need any special configuration. Just make sure the job is not trying to partition data as that could be overkill when you don't have a lot of data to process. Partitioning and repartitioning data on volumes of less than 1000 rows makes the job start and stop more slowly:
- APT_EXECUTION_MODE: By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS, MANY_PROCESS and NO_SERLIALIZE.
- $APT_CONFIG_FILE: lets you define a config file that will run these little jobs on just one node so they don't try any partitioning and repartitioning.
- $APT_IO_MAXIMUM_OUTSTANDING: when a job starts on a node it is allocated some memory for network communications - especially the partitioning and repartitioning between nodes. This is set to 2MB but when you have a squadron of very small jobs that don't partition you can reduce this size to make the job start faster and free up RAM memory.
Other Parameter Sets
You can set up all your default project Environment Variables to handle all data volumes in between. You can still have a Parameter Set for medium volume jobs if you have specific config files you want to use.
You might also create a ParameterSet called PX_MANY_STAGES which is for any job that has dozens of stages in it regardless of data volumes.
- APT_THIN_SCORE: Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.
This can be combined with the large volume Parameter Set in a job so you have extra configuration for high volume jobs with many stages.
You might also create a ParameterSet for a difficult type of source data file when default values don't work, eg. PX_MFRAME_DATA:
- APT_EBCDIC_VERSION: Certain operators, including the import and export operators, support the €Ċebcdic€� property specifying that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use.
- APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL: When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.
SAS is another operator that has a lot of configurable environment variables because when you are reading or writing native SAS datasets or running a SAS transformation you are handing some of the control over to SAS - these environment variables configure this interaction:
- APT_HASH_TO_SASHASH: can output data hashed using sashash - the hash algorithm used by SAS.
- APT_SAS_ACCEPT_ERROR: When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.
- APT_NO_SAS_TRANSFORMS: WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.
You can group all known debug parameters into a single debug file to make it easier for support to find:
- APT_SAS_DEBUG: Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log. Don't put this into your SAS Parameter Set as the support team might not be able to find it or know it exists.
- APT_SAS_DEBUG_IO: Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.
- APT_SAS_SCHEMASOURCE_DUMP: When using SAS Schema Source, sauses the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.
So a new developer who is handed a high volume job does not need to know anything about environment variables, they just need to add the right ParameterSet to the job. And if an experienced developer decides a new environment variable needs to be added to high volume jobs they just add it to the central ParameterSet and recompile all the jobs that use it. The "Where Used" function will help identify those jobs.
ParameterSets and environment variables make a powerful combination. ParameterSets can act as a layer that simplifies environment parameters and makes them easier to add to jobs.
No comments:
Post a Comment