In my previous blog post Finding duplicates using PowerBI I have explained how PowerBI can be used to find duplicates. It kept me thinking what are other effective feasible ways to find duplicates other than PowerBI. One of my customer was not using PowerBI and they had a huge dataset of contacts approximately 20 million, which needs to be cleansed. So, I started exploring
Solution 1 – Using Excel PowerQuery and Fuzzy matching
Using Excel’s fuzzy matching add-in we can easily find duplicates. As Microsoft has made the CDS connection available using Power Query (check this link), it has become a breeze to work on Dynamics 365 data. Excel is great with smaller datasets but with the above described scenario it may not be a best fit, and more over there will be lot of manual effort involved. Even automation is really hard.
Solution 2 – Using SSIS Fuzzy Matching
The Dynamics 365 Data Export service is a free-add which exports the Dynamics 365 CRM data to any customer owned Azure SQL DB. This is a one of the best add-on’s Microsoft has come up, which can be used in building up a Data warehouse to perform all kinds of reporting and data validations. So, once the Dynamics 365 CRM data is synchronized to an Azure SQL DB, we can create SSIS packages to find duplicates. SQL Server SSIS supports “Fuzzy Grouping” which is one of the best options to filter duplicate data. Check the Microsoft Docs link for detailed configuration steps. The only catch is “Fuzzy Grouping” needs access to SQL temp DB. So we cannot use Azure SQL DB in cloud, we need access to a local SQL Server or SQL Server on a cloud VM . The high level design is
Solution 3 – Using SQL SOUNDEX
“SOUNDEX” a predefined function part of Microsoft SQL, returns a four character code to evaluate the similarity of two strings. For example consider the below contacts which have a similar sound. They seem to be duplicates with typo’s. The SOUNDEX(fullname) of full name results with a same sound code. Using this feature we can definitely show duplicates within Dynamics.
In the example above, you can see three duplicate contacts with typo’s. The SOUNDEX results show the same code being generated. The design I have used is to use the soundcode feature to identify duplicates.
I have created a new attribute within Contacts as “SoundCode” which will be used to capture the psuedo code from SOUNDEX. Once the data is synchronized using Data Export Service to a Azure SQL DB, I have created a insert trigger which updates the record with a sound code. This information is captured within the newly created attribute.
The idea is to show these probable duplicates back in Dynamics 365 within the Contacts form. So, for that I have created a simple Azure function which retrieves the probable duplicates by querying the Contact table from the Azure SQL DB. I have created a simple HTML web resource which sends in a request to execute the Azure function and show the results within a table. This web resource is called on click of a button with in the Contact menu.
Considering the data volume of the customer, this solution proved out to give better results than the other solutions.