data warehouse - Aggregate Transformation vs Sort (remove Duplicate) in SSIS -
i'm trying populate dimension tables on regular basis , i've thought of 2 ways of getting distinct values dimension:
- using aggregate transformation , using "group by" operation.
- using sort transformation while removing duplicates.
i'm not sure 1 better (more efficient), or 1 adopted more in industry.
i tried perform tests using dummy data, can't quite solid answer.
p.s. using select distinct
source not option here.
my first choice correct in source query if possible. realise isn't option, sake of completeness future readers: first check whether had problem in source query creating duplicates. whenever distinct
seems necessary, first see whether there's problem query needs resolving.
my second choice distinct
- if possible - because 1 of cases quicker resolve in sql in ssis; realise that's not option you.
from point, you're getting situation might need try out remaining options. aside using aggregate or sort in ssis, dump results staging table, , have separate data flow use distinct
in source query. aggregate , sort both blocking transactions in ssis using staging table might end being faster - fastest depend on number of factors including nature of data, , nature of infrastructure. might want keep in mind else running in parallel if use ssis options, can memory-hungry.
if data (or can be) sorted in source or source query, there's clever idea in link below creating "semi-blocking" versions of aggregate , sort using script tasks:
Comments
Post a Comment