Off-late we had requirement to test the performance of our ETL. ETL as a process which consist of below
E- Extraction
T- Transformation
L-Loading
The data is acquired by various source systems(aka UpStream) can be relational database , flat files etc. the data is then transformed by various business logic and then loaded on Marts. We had to acquire data from SQL database, the acquisition was done by tool developed by MS , Replication. The Replication performance was good enough with default setting for replication. We calculated the latency between the publisher to Distributor and Distributor to Subscriber and the latency was less than 3-5 seconds. So having the good performance we focused on performance of Transformation and Loading part ; we had the SSIS framework and were using the control flow and Data flow task to load and transform the data .
The Loading and Transformation performance can be divided into following three part
1. Full/Initial run
2. No Data change in Upstream and perform the run
3. Populate the delta record in Upstream and perform the run.
The performance of the initial run was less than 2 hours for 25GB of data in various facts and dimension populated successfully. We than re-ran the job without any modification in source data and the ETL took 1 hours and 15 minutes ; since there was no data change i think this time was just to check the records against the last checkpoint where data was processed successfully and /or re-population the data by doing truncate of existing records. The last part was to populate the data in source systems and do the run again and capture the performance . The data loading was required to be done in key tables based on business requirement and the metrics collected. The data loading was done with separate utility in source system and we found out to load all the data successfully into source we might have to wait for 3 days. We could not wait for that time we verified the data loaded which was happened and took snapshot of the source DB and ran out the ETL job and captured the benchmarking. Snapshot in SQL DB is process which captures the DB details in time and later we can compare the performance benchmarking with equivalent data present in snapshot DB.Later we will talk about the ETL performance reports . Thanks all and happy ETL perf testing :-)
E- Extraction
T- Transformation
L-Loading
The data is acquired by various source systems(aka UpStream) can be relational database , flat files etc. the data is then transformed by various business logic and then loaded on Marts. We had to acquire data from SQL database, the acquisition was done by tool developed by MS , Replication. The Replication performance was good enough with default setting for replication. We calculated the latency between the publisher to Distributor and Distributor to Subscriber and the latency was less than 3-5 seconds. So having the good performance we focused on performance of Transformation and Loading part ; we had the SSIS framework and were using the control flow and Data flow task to load and transform the data .
The Loading and Transformation performance can be divided into following three part
1. Full/Initial run
2. No Data change in Upstream and perform the run
3. Populate the delta record in Upstream and perform the run.
The performance of the initial run was less than 2 hours for 25GB of data in various facts and dimension populated successfully. We than re-ran the job without any modification in source data and the ETL took 1 hours and 15 minutes ; since there was no data change i think this time was just to check the records against the last checkpoint where data was processed successfully and /or re-population the data by doing truncate of existing records. The last part was to populate the data in source systems and do the run again and capture the performance . The data loading was required to be done in key tables based on business requirement and the metrics collected. The data loading was done with separate utility in source system and we found out to load all the data successfully into source we might have to wait for 3 days. We could not wait for that time we verified the data loaded which was happened and took snapshot of the source DB and ran out the ETL job and captured the benchmarking. Snapshot in SQL DB is process which captures the DB details in time and later we can compare the performance benchmarking with equivalent data present in snapshot DB.Later we will talk about the ETL performance reports . Thanks all and happy ETL perf testing :-)
No comments:
Post a Comment