Thursday, August 13, 2015

SSIS Web Service Task


The Web Service task in SSIS can be used to call a web service command to perform a needed operation in your package. The results of most web services will be in XML Format. You can save these results into a variable or into a file. Either way, you can then parse the XML to find the node values you need. This can be done with the XML task in SSIS. I have a blog on the XML task also.
The Web Service I am using is a free demo service. It allows you to enter a zip code and returns the city and state associated with that zip code. This first thing you need on the package is an HTTP connection. The Server URL for this connection ishttp://www.webservicex.net/uszip.asmx?op=GetInfoByZIP. The rest of the settings for the connection are default.
clip_image002
Then drag a Web Service Task into the control flow. Set the connection to the HTTP connection you just created. You will need the WSDL file for the web service. This can be downloaded from the website where the web service is hosted. This file will need to be saved locally. Set the WSDL File to the location where you saved the WSDL file in the Web Service task in SSIS.
clip_image004
Click on the input node on the left window pane and set the input properties as shown below. These are drop down menus that are populated automatically by the WSDL file. Create a package variable with a data type of string to hold the zip code. Map that in the fields below as shown.
clip_image006
Under the output node set the Output type to Variable and create a results variable with a datatype of string.
clip_image008
If you enter the zip code of 32065 you will receive back this xml list.
clip_image010
This shows us the proper city and state associated with the zip code we entered. Now we can parse through the XML with an XML task and use this data. I show you how to do this in the next blog here.

SSIS XML Task


The XML Task in SSIS allows you to parse through an XML file and read the nodes in the XML. In a previous blog I showed how to use a web service task to get the city and state when entering a zip code.
The results we got back from the web service were in XML format and we saved them into a variable named Results. This image shows the value in the variable
clip_image002
We need to get the city and the state out of this XML and save them each into a variable. You will need to create an XML task in the Control Flow of a package. Set the operation type to XPATH. The source is going to be the results variable. The destination will be the city variable. The second operand will be the node you want to be read, “//CITY” in this case. Last we set the XPATH Operation to values because you want the value of the city node.
clip_image004
The XML Task that retrieves the State value will be identical except for the Second Operand and the Destination variable.
clip_image006
After all of this is complete the package with the web service task will come before the XML Tasks. I placed a break point on the last XML Task and I am showing the results of the variables in the watch window. You can see that the results contain the XML, the City, and State variables contain the values from their respective nodes.
clip_image008

SSIS:Best Practices for Better Performance

SQL Server Integration Services (SSIS) 10 Quick Best Practices
Here are the 10 SSIS best practices that would be good to follow during any SSIS package development
§  The most desired feature in SSIS packages development is re-usability. In other ways, we can call them as standard packages that can be re-used during different ETL component development. In SSIS, this can be easily achieved using template features. SSIS template packages are the re-usable packages that one can use in any SSIS project at any number of times. To know more about how to configure this, please see http://support.microsoft.com/kb/908018

§  Avoid using dot (.) naming convention for your package names. Dot (.) naming convention sometime confuses with the SQL Server object naming convention and hence should be avoided. Good approach would be to use underscore (_) instead of using dot. Also make sure that package names should not exceed 100 characters. During package deployment in SQL Server type mode, it is noticed that any character over 100 are automatically removed from package name. This might result your SSIS package failure during runtime, especially when you are using ‘Execute Package Tasks’ in your package.

§  The flow of data from upstream to downstream in a package is a memory intensive task, at most of the steps and component level we have to carefully check and make sure that any unnecessary columns are not passed to downstream. This helps in avoiding extra execution time overhead of package and in turn improves overall performance of package execution.

§  While configuring any OLEDB connection manager as a source, avoid using ‘Table or view’ as data access mode, this is similar to ‘SELECT * FROM <TABLE_NAME>, and as most of us know, SELECT * is our enemy, it takes all the columns in account including those which are not even required. Always try to use ‘SQL command’ data access mode and only include required column names in your SELECT T-SQL statement. In this way you can block passing unnecessary columns to downstream.

§  In your Data Flow Tasks, use Flat File connection manager very carefully, creating Flat File connection manager with default setting will use data type  string [DT_STR] as a default for all the column values. This always might not be a right option because you might have some numeric, integer or Boolean columns in your source, passing them as a string to downstream would take unnecessary memory space and may cause some error at the later stages of package execution.

§  Sorting of data is a time consuming operation, in SSIS you can sort data coming from upstream using ‘Sort’ transformation, however this is a memory intensive task and sometime result in degrade in overall package execution performance. As a best practice, at most of the places where we know that data is coming from SQL Server database tables, it’s better to perform the sorting operation at the database level where sorting can be performed within the query. This is in fact good because SQL Server database sorting is much refined and happens at SQL Server level. This in turn sometime results overall performance improvement in package execution.

§  During SSIS packages development, most of the time one has to share his package with other team members or one has to deploy same package to any other dev, UAT or production systems. One thing that a developer has to make sure is to use correct package protection level. If someone goes with the default package protection level ‘EncryptSenstiveWithUserKey’ then same package might not execute as expected in other environments because package was encrypted with user’s personal key. To make package execution smooth across environment, one has to first understand the package protection level property behaviour, please see http://msdn2.microsoft.com/en-us/library/microsoft.sqlserver.dts.runtime.dtsprotectionlevel.aspx . In general, to avoid most of the package deployment error from one system to another system, set package protection level to ‘DontSaveSenstive’.

§  It’s a best practice to take use of Sequence containers in SSIS packages to group different components at ‘Control Flow’ level. This offers a rich set of facilities
o   Provides a scope for variables that a group of related tasks and containers can use
o   Provides facility to manage properties of multiple tasks by setting property at Sequence container level
o   Provide facility to set transaction isolation level at Sequence container level.
For more information on Sequence containers, please see http://msdn2.microsoft.com/en-us/library/ms139855.aspx.
§  If you are designing an ETL solution for a small, medium or large enterprise business need, it’s always good to have a feature of restarting failed packages from the point of failure. SSIS have an out of the box feature called ‘Checkpoint’ to support restart of failed packages from the point of failure. However, you have to configure the checkpoint feature at the package level.  For more information, please see http://msdn2.microsoft.com/en-us/library/ms140226.aspx.


§  Execute SQL Task is our best friend in SSIS; we can use this to run a single or multiple SQL statement at a time. The beauty of this component is that it can return results in different ways e.g. single row, full result set and XML. You can create different type of connection using this component like OLEDB, ODBC, ADO, ADO.NET and SQL Mobile type etc. I prefer to use this component most of the time with my FOR Each Loop container to define iteration loop on the basis of result returned by Execute SQL Task. For more information, please see http://msdn2.microsoft.com/en-us/library/ms141003.aspx & http://www.sqlis.com/58.aspx.




This article intends to cover the performance improvement techniques and performance constraint scenarios based on the developer’s scope only.

Choice of Transformations
In a real world we would have to do several transformations before the data is actually loaded. The transformations use the buffer memory which in turn affects the performance.So it is very important to understand which transformations influence the performance and how
The transformations can be categorized in to Fully Blocking Transformations, Semi-Blocking Transformations and Non – Blocking transformations.
Fully Blocking Transformations: Blocks the entire dataset to perform the transformation.
Semi-Blocking: Blocks group of data to perform the transformations.
Non – Blocking: No blocking of datasets.
As a general rule, we should try to reduce the number of blocking and semi-blocking transformations.

Non-Blocking transformations
Semi-blocking transformations
Blocking transformations
Audit
Data Mining Query
Aggregate
Character Map
Merge
Fuzzy Grouping
Conditional Split
Merge Join
Fuzzy Lookup
Copy Column
Pivot
Row Sampling
Data Conversion
Unpivot
Sort
Derived Column
Term Lookup
Term Extraction
Lookup
Union All

Multicast


Percent Sampling


Row Count


Script Component


Export Column


Import Column


Slowly Changing Dimension


OLE DB Command

Extra Columns
The most common mistake one does while starting to develop an ssis package is to choose all the columns even if some of them are not required. This might not really sound like a big deal. Consider a scenario where you need to use two fields from a source table which has hundred odd fields. The dataset uses much more buffer size than actually required.
Ensure you always select only those columns which are requires. SSIS by default shows warning messages of column names which are not used.
Configuring Look ups
One of the most common transformations used in SSIS .
The default lookup query for the Lookup Transform is
SELECT * FROM …
The look up allows you to select the table for a look up or a sql query. It is always advisable to use a sql query and only choose the respective columns.
Enabling full caching in look ups enhances the performance of the transformation. However this works only if there are no duplicate records. Another common issue occurs with the blank spaces in the fields for look up. The look up returns no matching data. Its better you trim the fields to get matching records in full cache mode.
Using of SCD
The Slowly changing dimensions are used normally for insert, update or delete records in the table based on the source table data.An alternative approach for this purpose could be done in sql query if both the source and destination is in the same server or through linked servers.
The merge functionality in SQL server 2008 onwards lets you do just that. Also you could write a join query to find those matching records which needs to be changes.
Configuring Source component
Some of the transformations can be totally avoided if they are performed in the source component. Joining two tables from same server, filtering data , sorting data or grouping them can be performed in simple sql query in the source components.
Configuring Destination  component
Fast load vs. normal load
The difference is simple, the former is bulk insert while the later is a row by row insert.(Use a SQL  profiler to see the difference) If you are quite sure about the data that is being processed and if you want to considerably reduce the time taken for huge data insert then Fast load is THE ONE which you need to do. However there are some draw backs, you can’t divert the specific error rows. This is because when there is an error the entire bulk fails. A work around for this is to redirect that failed batch to another destination and do a row by row insert to the same table and get the error record redirected.
OLEDB Adaptor vs. SQL adaptor
If the package is executed on the same machine then using SQL Server adaptor as destination improves the performance considerably.
SSIS Properties
The buffer used for DFT can be altered by the properties DefaultBufferMaxSize and DefaultBufferMaxRows.By increasing them the number of buffers through the data flow. However this should not be increased too much which in turn affects the disk space and does not serve the purpose.
Parallel execution of tasks can be increased by the property MaxConcurrentExecutables.Along with the EngineThreads propery which controls the number of worker threads you need to figure out the right number of parallel executables.
Configuring Flat File Source
While using flat file source it is important we don’t do any unnecessary conversions of the columns .By default all the data are read as strings, so it is important you convert only those columns which require conversions to other type. (Including nvarchar to varchar)
 FastParse indicates whether the column uses the quicker, but locale-insensitive, fast parsing routines that Integration Services provides 
Setting the Fast Parse option in flat file source improves the performance by 7 to 20 % in large files.
Usage of Indexes
I can’t conclude the performance chapter without a mention on the usage of indexes. The Indexes could be a huge constraint while inserting high volumes of data in to tables with several indexes. A work around is to drop and recreate the indexes while inserting data.
Also ensure any unused indexes should be removed from the table.
On the other hand indexes are useful in the source table. Hence put some thought while creating /deleting them.




Performance Improvement in SSIS

(E) EXTRACT IMPROVEMENT

1) If there is a Flat file source / Derived Column Transformation, then set "Fast Parse" to "True.
    - It is available only in Flat File Source & Derived Column Transformation
    - It is specified at column level
    - Default value is False,
    - When we set it true, it will avoid some kind of pre-execute validations and considers all your data fine

    Steps

  1. Right-click the Flat File source or Data Conversion transformation, and then click Show Advanced Editor.
  2. In the Advanced Editor dialog box, click the Input and Output Properties tab.
  3. In the Inputs and Outputs pane, click the column for which you want to enable fast parse.
  4. In the Properties window, expand the Custom Properties node, and then set the FastParse property to True.
  5. Click OK.

2) Set packet size to 32767 for Connection Manager.
    - This will bump up the packet size from 4K (which is default)
    - This needs network admin to enable "Jumbo Frames"


3) In OLEDB source, use T-SQL Query instead of table as a direct input 
    - This will allow you to choose specific columns instead of pulling all the columns
    - We can specify nolock which avoids locking the table
    - We can use sort, group by, joins, forumlated columns instead of using different transformations like Sort,       
       Merge Join, Derived Column, Aggregate transformations.

4) In Cache connection manager, try to use create a file instead of using memory

5) If same OLEDB source connection, you are using at multiple places, then set "RetainSameConnection" property to "True"

     - This will allow engine to use the same connection every time
     - Default value : False. This will create connection - get data - close connection every time. 
     - by making it to TRUE, above activities will be done only once.


6) Divide source into a chunk instead of having a single master pool.  

(T) TRANSFORM IMPROVEMENT

1) Use Transformation based on the usage and buffer matrix

Behind the scenes, the data flow engine uses a buffer-oriented architecture to efficiently load and manipulate datasets in memory.
  • Row Transformations - 
    - They either manipulate data / create new fields using the data that is available in that row. 
    - They might create new columns but not new rows
    - Each output row has a 1:1 relationship with an input row
    - Also known as synchronous transformations
    Uses existing buffer rather than new buffer
    Examples -  Derived Column, Data Conversion, Multicast, and Lookup. 

  • Partially blocking transformations 
    - They are often used to combine datasets using multiple data inputs. 
    - As a result, their output may have the same, greater, or fewer records than the total number of input records. 
    - Also known as asynchronous transformations
    - Output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    Examples - Merge, Merge Join, and Union All. 

  • Blocking transformations 
    - They must read and process all input records before creating any output records. 
    - They perform the most work and can have the greatest impact on available resources rather than above 2 categories
    - Also known as asynchronous transformations
    - With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    Example - Aggregate and Sort. 


2) Use MERGE statement when you have to do INSERT-UPDATE-DELETE from source to target instead of using multiple transformations

3) Choice of type of cache inside Lookup Transformation
     - Full Cache : for small dataset
     - No Cache : for volatile dataset
     - Partial Cache : for large dataset

4)  Sort, Merge Join, Union All, Pivot, Aggregation SCD, Data Conversion can be easily replaced by normal T-SQL

      - There will be much more control on all the objects
      - T-SQL operation will be much more faster than SSIS Transformations because all the 
         buffers won't be used.


5) Make datatype as narrow as possible so that they will allocate less memory



(L) LOAD IMPROVEMENT

1) Try to execute the package on your destination server, rather than source server. 

      - LOAD is expensive operation than EXTRACT
      - So we can execute the package on the same server as destination server

2) Make a smart choice between Dropping/Keeping Index

      - It is not necessary to keep index always OR drop index always before you load.
      - If there is a clustered index, don't drop because data is sorted using this key. And dropping  
        and rebuilding clustered index will take even more time.
      - If there is a single non-clustered index and you expect more than 100% new data, then 
        dropping and re-creating index will help.
      - If there are multiple non-clustered index, probably leave them as it is.

   But these are not thumb rules, trial and error will always give you the best result.

3) If there is a huge huge load on destination, probably partitioning a table will help

4) If there is a huge huge load on destination, probably partitioning a table will help

5) Setting proper value of "Rows per batch" &  "Maximum Insert Commit Size"


Rows per batch - how many rows you want to send to insert the data
Maximum insert Commit Size -  how may rows you want to commit in one shot
      - If the value is 2147483647, these many rows will be committed in one single transaction and 
        they will be committed.
      - If you really have these many rows to load, better you define proper value in this commit 
        size. Let's say if you define 100000, then 1 lac rows will be committed in one shot. A huge 
        DML operation in one single transaction will degrade the performance.
      - If it is 0, it means, a package might stop responding, if the same table is being used by 
        some other source.

SSIS Tips and Tricks

SQL:Calculating Ideal Default Buffer Max Row and Default Buffer Size in SSIS

Calculating Default Buffer Max Row  and Default Buffer Size


Calculate the Table Size using
Exec sp_spaceused  ‘tablename’
This should give you table size occupied in KB and the no of Records in the table 





In case you are including the error records and derived columns  then you need to include that in to account during calculation.
Calculate using the derived column datatype and size , and include the same in the row size.
Now the size of each row in Bytes= (total size of table * 1024)/No of Rows  
Defualt Buffer size can be extended upto 10,485,760 Bytes (10 GB Approx)
No of Rows for 10 GB Buffer Size = 10,485,760/Size of Each Row