DataStage Tutorial
- InfoSphere DataStage- Parallel Framework Standard Practices
- Information Server 8
- Architecture and Information Tiers
- IBM Information Management InfoSphere Services
- Center of Excellence for Data Integration (CEDI)
- About DataStage
- Client Components / DataStage Designer.
- DataStage Director.
- DataStage Manager.
- DataStage Administrator
- Server Components
- DataStage Features
- DataStage Manager Roles
- Types of jobs
- Transformer Stage
- Transformer Editor Components
- Transformer Stage Basic Concepts
- DATABASE Stages
- PROCESSING Stages
- AGGREGATOR Stage
- FOLDER Stage
- IPC Stage:
- LINK PARTITIONER Stage:
- LINK COLLECTOR Stage
- DataStage NLS
- JOB
- Built-In Stages – Server Jobs
- DATASTAGE
- COMPONENTS OF DATASTAGE
- SERVER COMPONENTS
- DATASTAGE PROJECTS
- DATASTAGE JOBS
- SPECIAL ENTITIES
- TYPES OF STAGES
- DATASTAGE NLS
- SETTING UP YOUR PROJECT
- STARTING THE DATASTAGE DESIGNER
- CREATING A JOB
- DEVELOPING A JOB
- What does a Config File in parallel extender consist of?
- Types of Parallel Processing in datastage
- How to handle Date convertions in Datastage Convert mm/dd/yyyy format to yyyy-dd-mm?
- Question: What are OConv () and Iconv () functions and where are they used?
- How to run a Shell Script within the scope of a Data stage job?
- What are the command line functions that import and export the DS jobs?
- How do you execute datastage job from command line prompt?
- What is the Usage of Containers? What are its types?
- What are Static Hash files and Dynamic Hash files?
- What are types of Hashed File?
- What is Hash file stage and what is it used for?
- What is the default cache size? How do you change the cache size if
- What are Stage Variables, Derivations and Constants?
- Differentiate Primary Key and Partition Key?
- Orchestrate Vs Datastage Parallel Extender
- What is the flow of loading data into fact & dimensional tables?
- What is the default cache size? How do you change the cache size if needed?
- What are types of Hashed File?
- What is Modulus and Splitting in Dynamic Hashed File?
- What are stage variables, in dynamic hashed file?
- Types of views in data stage director?
- Types of parallel processing?
- Orchestrate Vs Data stage Parallel Extender?
- Importance of surrogate key in data warehousing?
- How to run a shell script within the scope of a data stage job?
- How to handle date conversions in data stage? Convert an mm/dd/yyyy format to yyyy-dd-mm?
- Functionality of Link Partitioner and Link Collector?
- How do you execute datastage job from command line prompt
- Types of Dimensional Modeling?
- Differentiate Primary Key and Partition Key?
- Differentiate Database data and Data warehouse data?
Ping Yahoo
bluehost 3.95 promo
DataStage Tutorial
vps web hosting
- About DataStage
- Client Components / DataStage Designer.
- DataStage Director.
- DataStage Manager.
- DataStage Administrator
- Server Components
- DataStage Features
- DataStage Manager Roles
- Types of jobs
- Transformer Stage
- Transformer Editor Components
- Transformer Stage Basic Concepts
- DATABASE Stages
- PROCESSING Stages
- AGGREGATOR Stage
- FOLDER Stage
- IPC Stage:
- LINK PARTITIONER Stage:
- LINK COLLECTOR Stage
- DataStage NLS
- JOB
- Built-In Stages – Server Jobs
- DATASTAGE
- COMPONENTS OF DATASTAGE
- SERVER COMPONENTS
- DATASTAGE PROJECTS
- DATASTAGE JOBS
- SPECIAL ENTITIES
DataStage Interview Questions
- What are Sequencers?
- What does a Config File in parallel extender consist of?
- Types of Parallel Processing in datastage
- How to handle Date convertions in Datastage Convert mm/dd/yyyy format to yyyy-dd-mm?
- Question: What are OConv () and Iconv () functions and where are they used?
- How to run a Shell Script within the scope of a Data stage job?
- What are the command line functions that import and export the DS jobs?
- How do you execute datastage job from command line prompt?
- What is the Usage of Containers? What are its types?
- What are Static Hash files and Dynamic Hash files?
- What are types of Hashed File?
- What is Hash file stage and what is it used for?
- What is the default cache size? How do you change the cache size if
- What are Stage Variables, Derivations and Constants?
- Differentiate Primary Key and Partition Key?
- Orchestrate Vs Datastage Parallel Extender
- What is the flow of loading data into fact & dimensional tables?
- What is the default cache size? How do you change the cache size if needed?
- What are types of Hashed File?
- What is Modulus and Splitting in Dynamic Hashed File?
- What are stage variables, in dynamic hashed file?
- Types of views in data stage director?
- Types of parallel processing?
- Orchestrate Vs Data stage Parallel Extender?
- Importance of surrogate key in data warehousing?
- How to run a shell script within the scope of a data stage job?
- How to handle date conversions in data stage? Convert an mm/dd/yyyy format to yyyy-dd-mm?
- Functionality of Link Partitioner and Link Collector?
- How do you execute datastage job from command line prompt
- Types of Dimensional Modeling?
- Differentiate Primary Key and Partition Key?
- Differentiate Database data and Data warehouse data?
- How did u connect with DB2 in your last project?
- What are Routines and where – how are they written and have you written?
- How did you handle an Aborted sequencer?
- Question: Read the String functions in DS
DataStage Jobs:
Data Stores:
For the purposes of this section, a data store is a physical piece of disk storage where data is held for a period of time. In DataStage terms, this can be either a table in a database structure or a file contained in a disk directory or catalog
structure. Data held in a database structure is referred to as either a table or a view. In data warehousing, two additional subclasses of table might be used: dimension and fact. Data held in a file in a directory structure is classified according to its type, for example: Sequential File, Parallel Dataset, Lookup File Set, and so on.
The concepts of “source” and “target” can be applied in a couple of ways. Every job in a series of jobs could consider the data it gets in to be a source and the data it writes out as being a target. However, for the sake of this naming convention a source is only data that is extracted from an original system. A target is the data structures that are produced or loaded as the final result of a particular series of jobs. This is based on the purpose of the project: to move data from a source to a target.
Data stores used as temporary structures to land data between jobs, supporting restart and modularity, should use the same names in the originating job and any downstream jobs reading the structure.
Sequencer Object Naming:
In a job Sequencer, links are actually messages. Proceed sequencer links with the class word msg_ followed by the type of message (as examples, fail and unconditional), and followed by the ClassName. The following lists shows some examples:
- Reception Succeeded Message: msg_ok_Reception
- Reception Failed Message: msg_fail_Reception
Stage Names:
DataStage assigns default names to stages as they are dragged onto the Designer canvas. These names are based on the type of stage (Object) and a unique number, based on the order the object was added to the flow. In a job or job sequence, stage names must be unique.
Links:
In a DataStage job, links are objects that represent the flow of data from one stage to the next. In a job sequence, links represent the flow of a message from one activity or step to the next. In a DataStage job, links are objects that represent the flow of data from one stage to the next. In a job sequence, links represent the flow of a message from one activity or step to the next.
It is particularly important to establish a consistent naming convention for link names, instead of using the default DSLink# (where # is an assigned number). In the graphical Designer environment, stage editors identify links by name. Having a descriptive link name reduces the chance for errors (for example, during link ordering). Furthermore, when sharing data with external applications (for example, through job reporting), establishing standardized link names makes it easier to understand results and audit counts.
To differentiate link names from stage objects, and to identify in captured metadata, the prefix lnk_ is used before the subject name of a link.
The following rules can be used to establish a link name:
- The link name should define the subject of the data that is being moved.
- For non-stream links, the link name should include the link type (reference, reject) to reinforce the visual cues of the Designer canvas:
- Ref for reference links (Lookup)
- Rej for reject links (such as Lookup, Merge, Transformer, Sequential File, and Database
The type of movement might optionally be part of the Class Word. As examples:)
- In for input
- Out for output
- Upd for updates
- Ins for inserts
- Del for deletes
- Get for shared container inputs
- Put for shared container outputAs data is enriched through stages, the same name might be appropriate for multiple links. In this case, specify a unique link name in a particular job or job sequence by including a number. (The DataStage Designer does not require link names on stages to be unique.)
The following list provides sample link names:
- Input Transactions: lnk_Txn_In
- Reference Account Number Rejects: lnk_Account_Ref_Rej
- Customer File Rejects: lnk_Customer_Rej
Shared Containers:
Shared containers have the same naming constraints as jobs in that the name can be long but cannot contain underscores, so word capitalization must be used for readability. Shared containers might be placed anywhere in the
repository tree and consideration must be given to a meaningful directory hierarchy. When a shared container is used, a character code is automatically added to that instance of its use throughout the project. It is optional as to whether you decide to change this code to something meaningful.
To differentiate between parallel shared containers and server shared containers, the following class word naming is recommended:
- Psc = Parallel Shared Container
- Ssc = Server Edition Shared Container
Note Use of Server Shared Containers is discouraged in a parallel job.
Examples of Shared Container naming are as follows:
- AuditTrailPsc (original naming as seen in the Category Directory)
- AuditTrailPscC1 (an instance of use of the previously mentioned shared container)
- AuditTrailPscC2 (another instance of use of the same shared container)
In the aforementioned examples the characters C1 and the C2 are automatically applied to the Shared Container stage by DataStage Designer when dragged onto the design canvas.
DataStage Folder Hierarchy:
The DataStage repository is organized in a folder hierarchy, allowing related objects to be grouped together. Folder names can be long, are alpha numeric and can also contain both spaces and underscores. Therefore, directory names are word capitalized and separated by either an underscore or a space
Information Server 8 maintains the restriction that there can only be a single object of a certain type with a given name.
Object Creation
In Information Server 8, object creation is simplified. To create a new object, right-click the target parent folder, select New and the option for the desired object
Categorization by Functional Module
For a given application or functional module, all objects can be grouped in a single top-level folder, with sub-levels for separate object types, as in Figure 3-6 on page 38. Job names must be unique in a DataStage project, not only in a folder.
Categorization by Developer
In development projects, folders might be created for each developer as their personal sandbox. That is the place where they perform unit test activities on jobs they are developing.
It is the responsibility of each developer to delete unused or obsolete code. The development manager, to whom is assigned the DataStage Manager role, must ensure that projects are not inflated with unused objects (such as jobs, sequences, folders, and table definitions).
Again, object names must be unique in a given project for the given object type. Two developers cannot save a copy of the same job with the same name in their individual sandbox categories. A unique job name must be given.
Table Definition Categories
Table definition stored directly underneath Table Definitions. Its data source
type and data source name properties do not determine names for parent subfolders.
When saving temporary TableDefs (usually created from output link definitions to assist with job creation), developers are prompted for the folder in the “Save Table Definition As” window. The user must pay attention to the folder location, as these objects are no longer stored in the Table Definition category by default.
Jobs and Job Sequences
Job names must begin with a letter and can contain letters, numbers, and underscores only. Because the name can be long, job and job sequence names must be descriptive and should use word capitalization to make them readable.
Jobs and job sequences are all held under the Category Directory Structure, of which the top level is the category Jobs.
A job is suffixed with the class word Job and a job sequence is suffixed with the class word Seq.
The following items are examples of job naming:
- CodeBlockAggregationJob
CodeBlockProcessingSe
Jobs must be organized under category directories to provide grouping such that a directory should contain a sequence job and all the jobs that are contained in that sequence.
Naming Conventions by Object Type:
In this section we describe the object type naming conventions.
Projects
Each DataStage Project is a standalone repository. It might have a one-to-one relationship with an organizations’ project of work. This factor can cause terminology issues especially in teamwork where both business and developers are involved.
The name of a DataStage project is limited to a maximum of 18 characters. The project name can contain alphanumeric characters and underscores.
Projects names must be maintained in unison with source code control. As projects are promoted through source control, the name of the phase and the project name should reflect the version, in the following form:
<Phase>_<ProjectName>_<version>
Project phases
|
Phase Name |
Phase Description |
| Dev | Development |
| IT | Integration Test |
| UAT | User Acceptance Test |
| Prod | Production |
Documentation and Metadata Capture:
One of the major problems with any development effort, whatever tool you use, is maintaining documentation. Despite best intentions, and often due to time constraints, documentation is often something that is left until later or is inadequately implemented. Establishing a standard method of documentation with examples and enforcing this as part of the acceptance criteria is strongly recommended. The use of meaningful naming standards (as outlined in this section) compliments these efforts.
DataStage provides the ability to document during development with the use of meaningful naming standards (as outlined in this section). Establishing standards also eases use of external tools and processes such as InfoSphere Metadata Workbench, which can provide impact analysis, as well as documentation and auditing
Designer Object Layout:
The effective use of naming conventions means that objects need to be spaced appropriately on the DataStage Designer canvas. For stages with multiple links, expanding the icon border can significantly improve readability. This approach takes extra effort at first, so a pattern of work needs to be identified and adopted to help development. The Snap to Grid feature of Designer can improve development speed.
When development is more or less complete, attention must be given to the layout to enhance readability before it is handed over to versioning.
Where possible, consideration must be made to provide DataStage developers with higher resolution screens, as this provides them with more monitor display real-estate. This can help make them more productive and makes their work more easily read.





