Microsoft Technologies: Change Data Capture in SQL Server 2008

CDC is one of the new data tracking and capturing features of SQL Server 2008. It only tracks changes in user-created tables. Because captured data is then stored in relational tables, it can be easily accessed and retrieved subsequently, using regular T-SQL.

When you apply Change Data Capture features on a database table, a mirror of the tracked table is created with the same column structure of the original table, but with additional columns that include the metadata used to summarize the nature of the change in the database table row. The SQL Server DBA can then easily monitor the activity for the logged table using these new audit tables .

CDC first has to be enabled for the database. Because CDC is a table-level feature, it then has to be enabled for each table to be tracked. you will see that a schema with the name ‘cdc’ has now been created.

Some System Tables will have been created within the database as part of the cdc schema.

The table which have been created are listed here.

· cdc.captured_columns – This table returns result for list of captured column.

· cdc.change_tables – This table returns list of all the tables which are enabled for capture.

· cdc.ddl_history – This table contains history of all the DDL changes since capture data enabled.

· cdc.index_columns – This table contains indexes associated with change table.

· cdc.lsn_time_mapping – This table maps LSN number (for which we will learn later) and time.

Before enabling CDC at the table level, make sure that you have enabled SQL Server Agent. When CDC is enabled on a table, it creates two CDC-related jobs that are specific to the database, and executed using SQL Server Agent. Without SQL Server Agent enabled, these jobs will not execute.

Additionally, it is very important to understand the role of the required parameter @role_name. If there is any restriction of how data should be extracted from database, this option is used to specify any role which is following restrictions and gating access to data to this option if there is one. If you do not specify any role and, instead, pass a NULL value, data access to this changed table will not be tracked and will be available to access by everybody.

The Stored Procedure sys.sp_cdc_enable_table enables CDC. There are several options available with this SP but we will only mention the required options for this SP. CDC is very powerful and versatile tool. By understanding the Stored Procedure sys.sp_cdc_enable_table you will gain the true potential of the CDC feature. One more thing to notice is that when these jobs are created they are automatically enabled as well

As you will see there are five additional columnsto the mirrored original table

· __$start_lsn

· __$end_lsn

· __$seqval

· __$operation

· __$update_mask

There are two values which are very important to us is __$operation and __$update_mask.

Column _$operation contains value which corresponds to DML Operations. Following is quick list of value and its corresponding meaning.

· Delete Statement = 1

· Insert Statement = 2

· Value before Update Statement = 3

· Value after Update Statement = 4

The column _$update_mask shows, via a bitmap, which columns were updated in the DML operation that was specified by _$operation. If this was a DELETE or INSERT operation, all columns are updated and so the mask contains value which has all 1’s in it. This mask is contains value which is formed with Bit values.

stored procedure sys.sp_cdc_enable_table has several parameters

The parameter @capture_instance determines the name of the capture instance for this table. The name itself has no impact on performance. However, this parameter can be used to create a second capture instance for the same table. The feature of having a second capture instance is only intended for schema upgrade scenarios where a change to the schema of the source table is performed. The old capture instance should be disabled as soon as it is no longer needed, because having two capture instances active on one table means that twice as much change data needs to be written. This can significantly impact performance.

When you have a schema change on your source table, you should do the following:

1. Change the schema of your source table.

2. Create a new capture instance. That new capture instance is created with the new (changed) schema of the source table.

3. Wait for a change to occur, and then read the minimum LSN from the new capture instance (using sys.fn_cdc_get_min_lsn(new capture instance name)).

4. Read and process all changes from the old capture instance up to but not including the first LSN from the new capture instance.

5. Disable the old capture instance.

6. Update all references to the old capture instance name with the new name.

7. Continue reading from the new capture instance.

The parameter @captured_column_listdetermines which columns of the source table are included in the change table and captured by change data capture. If this parameter is not specified or NULL, all columns of the source table are included in the change table. We found that the number and size of captured columns has a significant impact on change data capture performance and required disk space. Change data capture performance is generally better when the number of captured columns is smaller because the amount of data that needs to be written to the change tables is smaller.

The parameter @supports_net_changes determines whether net change queries to the change table (using cdc.fn_cdc_get_net_changes_<capture_instance>) are possible. Net change queries result in only one change row for each source row that was changed, independent of the number of changes. For instance, if stock market prices are tracked per stock and the price column is updated several times a day, an all changes query would return every change of every stock price in the queried LSN interval. A net change query would result in only one change row per stock with the final price in the data.

When @supports_net_changes is set to 1, an additional nonclustered index is created on the change table and the net changes query function is created. Because this index needs to be maintained, we found that enabling net changes can have negative impact on change data capture performance.

The parameter @filegroup_name determines the name of the filegroup where the change tables are created. If this parameter is not specified, the default filegroup will be used. Because in most cases the default filegroup is PRIMARY and it is a best practice to keep the PRIMARY filegroup small, a filegroup name should always be specified.

If you are using AdventureWorks database, it creates the jobs with following names.

1. cdc.AdventureWorks_capture – When this job is executed it runs the system stored proceduresys.sp_MScdc_capture_job. The procedure sys.sp_cdc_scan is called internally bysys.sp_MScdc_capture_job. This procedure cannot be executed explicitly when a change data capture log scan operation is already active or when the database is enabled for transactional replication. This system SP enables SQL Server Agent, which in facts enable Change Data Capture feature.

2. cdc.AdventureWorks_cleanup – When this job is executed it runs the system stored proceduresys.sp_MScdc_cleanup_job. This system SP cleans up database changes tables.

The capture job created in SQL Server Agent for change data capture (typically named cdc.database name_capture) contains only a call to the procedure sys.sp_MScdc_capture_job without parameters. This procedure determines the scan job parameters and calls sys.sp_cdc_scan with these parameters. The procedure sys.sp_cdc_scan does the actual work by scanning the log (using sys.sp_replcmds) and inserting the data that needs to be captured into the change tables.

There are four parameters in sys.sp_cdc_scan that determine the behavior of the capture job.

The first parameter is continuous (default value 1). It determines whether the capture job runs continuously (value 1) or exits after one scan phase (one shot mode, value 0). One shot mode is recommended for testing only, not for production use. The main reason for this is that the log records stay active until change data capture has processed them. So, the log will keep growing when the scan job is not running.

The other three parameters determine how often and how many transactions are read from the log and inserted to the change tables.

The parameter maxtrans (default value 500) determines how many transactions are read from the log and written to the change tables. This write is done in one transaction. The parameter maxscans (the default value is 10) determines how many of these scan cycles are attempted before the job is ended (continuous = 0) or before pausing for an interval (continuous=1). The length of this pause interval is set by the parameter pollinginterval (in seconds, with a default value of 5 seconds). WAITFOR is executed when a scan cycle drains the log completely or when maxscans scan cycles are completed.

The scan job parameters can be changed by using sys.sp_cdc_change_job. Because these parameters are only read during the initialization of the capture job, the capture job needs to be stopped using EXEC sys.sp_cdc_stop_job @job_type = 'capture' and subsequently restarted using EXEC sys.sp_cdc_start_job @job_type = 'capture' so the changed parameters are applied.

CDC (Change Data Capture)

OVERVIEW

PRE-REQUISITE

Works only in Enterprise, Developer and Evaluation Edition
SQL Agent Service should be up & running

PRE-CONDITION

There should not be any schema with a name "cdc"
There should not be any column in a table called __$start_lsn, __$end_lsn, __$seqval, __$operation, and __$update_mask

BEST PRACTICE

all cdc changes should be tracked in a separate filegroup rather than using Primary
configure only required tables and columns

ENABLE CDC ON DATABASE

EXEC sys.sp_cdc_enable_db

Go

SELECT [name], database_id, is_cdc_enabled FROM sys.databases

This will create some columns under "cdc" schema which will be introduced.

ENABLE CDC ON TABLE

EXEC sys.sp_cdc_enable_table

@source_schema = 'dbo', -- Schema Name

@source_name = 'CDCCheck' , -- Table Name

@role_name = NULL,

@captured_column_list = 'a,b,c', -- Comma separate column list, if not given all columns will be considered

@filegroup_name = 'CDCFileGroup'

A new table will be created under CDC schema called "CDC.dbo_CDCCheck_CT". Apart from columns specified above, it will have 5 extra columns

__$Start_lsn - Log Sequence Number

__$end_lsn - Log Sequence Number

__$sequal

__$operation

1 - Insert
2 - Delete
3 - Before Update
4 - After update

__$update_mask

if the value is 0x12 means 0b10010 in binary means 2nd and 5th column has been updated
if the value is 0x1F means 0b11111 in binary means all 5 columns have been updated

DISABLE CDC ON TABLE

EXECUTE sys.sp_cdc_disable_table

@source_schema = N'HumanResources',

@source_name = N'Shift'

ADVANTAGES OF CDC

CDC talks with transaction log, pulls data from there and put them in cdc table. So it gives best performance in comparison with trigger approach
Low development cost, easy maintenance, clean approach of tracking audit.

LIMITATION/PROBLEM WITH CDC

TRUNCATE won't be allowed if CDC is enabled on table

If a new column gets added or existing column gets removed, it does not reflect CDC table. Alternate solution is, cdc should be disabled and enabled.

NVARCHAR(MAX), VARBINARY(MAX) columns will be part of each update operation even though their value is untouched. Ex. In a table, there are 4 columns. Out of this one column in NVARCHAR(MAX). Assume that 1 int column value has been changed but NVARCHAR(MAX) is untouched, still in CDC table, it will be tracked as the previous value was NULL and new value is existing value.

Microsoft Technologies

Pages

Monday, July 29, 2013

Change Data Capture in SQL Server 2008

CDC (Change Data Capture)

About Me