Alternate ways to find duplicates in Dynamics 365 CRM

In my previous blog post Finding duplicates using PowerBI I have explained how PowerBI can be used to find duplicates. It kept me thinking what are other effective feasible ways to find duplicates other than PowerBI. One of my customer was not using PowerBI and they had a huge dataset of contacts approximately 20 million, which needs to be cleansed. So, I started exploring

Solution 1 – Using Excel PowerQuery and Fuzzy matching

Using Excel’s fuzzy matching add-in we can easily find duplicates. As Microsoft has made the CDS connection available using Power Query (check this link), it has become a breeze to work on Dynamics 365 data.  Excel is great with smaller datasets but with the above described scenario it may not be a best fit, and more over there will be lot of manual effort involved. Even automation is really hard.

Solution 2 – Using SSIS Fuzzy Matching

The Dynamics 365 Data Export service is a free-add which exports the Dynamics 365 CRM data to any customer owned Azure SQL DB. This is a one of the best add-on’s Microsoft has come up, which can be used in building up a Data warehouse to perform all kinds of reporting and data validations. So, once the Dynamics 365 CRM data is synchronized to an Azure SQL DB, we can create SSIS packages to find duplicates. SQL Server SSIS supports “Fuzzy Grouping” which is one of the best options to filter duplicate data. Check the Microsoft Docs link for detailed configuration steps. The only catch is “Fuzzy Grouping” needs  access to SQL temp DB. So we cannot use Azure SQL DB in cloud, we need access to a local SQL Server or SQL Server on a cloud VM . The high level design is

image

Solution 3 – Using SQL SOUNDEX

SOUNDEX” a predefined function part of Microsoft SQL, returns a four character  code to evaluate the similarity of two strings. For example consider the below contacts which have a similar sound. They seem to be duplicates with typo’s. The SOUNDEX(fullname) of full name results with a same sound code. Using this feature we can definitely show duplicates within Dynamics.

image

In the example above, you can see three duplicate contacts with typo’s. The SOUNDEX results show the same code being generated. The design I have used is to use the soundcode feature to identify duplicates.

image

I have created a new attribute within Contacts as “SoundCode” which will be used to capture the psuedo code from SOUNDEX. Once the data is synchronized using Data Export Service to a Azure SQL DB, I have created a insert trigger which updates the record with a sound code. This information is captured within the newly created attribute.

image

The idea is to show these probable duplicates back in Dynamics 365 within the Contacts form. So, for that I have created a simple Azure function which retrieves the probable duplicates by querying the Contact table from the Azure SQL DB.  I have created a simple HTML web resource which sends in a request to execute the Azure function and show the results within a table. This web resource is called on click of a button with in the Contact menu.

Considering the data volume of the customer, this solution proved out to give better results than the other solutions.

Find Duplicate Data in Dynamics using PowerBI

One of the critical tasks of today’s organizations is data quality and handling. The key to data quality is making sure that data is unique and filtered so that it will be easy to integrate with different systems with in the organization. To avoid duplicate within the data is one of he major tasks every organization is trying to hardwire within their critical systems. In this series of blog posts, I going to discuss few scenarios of handling duplicate data in Microsoft Dynamics 365.

We all know the traditional OOTB De-Duplication feature of Microsoft Dynamics 365 CRM, which is majorly used to detect duplicates. In few scenarios we even use “Unique Keys” to restrict duplicates. But these features either use “Exact” or ”Partial” match precisions. But this feature cannot handle typo’s or misspellings. Recently Microsoft has released the Fuzzy Matching feature as part of Power BI. For more details read FuzzyMatching PowerBI. Lets look at the steps to create a PowerBI report which shows up probable duplicates with in Dynamics 365 CRM system using Fuzzy Matching.

Open Power BI Desktop, create a new BI report. We are going to use the new CDS connector to connect to the Dynamics 365 CRM system. Select the Data Source as CDS connector.

image

Set the URL to the Dynamics 365 CRM instance.

image

Select the entity which requires to be filtered. Lets assume that we have duplicate data as part of Contact entity. As you can see the FirstName is misspelled.

image

Select the Contact Entity as shown below.

image

For our scenario, as the duplicate data is part of the contact names, lets select the fullname field along with contactid for filtering.

image

Once selected, the data will be shown in the view. The next step is to apply fuzzy matching on the data to filter out the duplicate data. Click the “Merge Queries” or “Merge Queries as New option”.image

From the Merge Wizard select the parent and the reference entity as Contact. We are going to perform an left outer join on the results to find the duplicates. Select the fullname column to create the merged table.

image

As we are going to use the Fuzzy match, check the box. Set the threshold to a value between 0 to 1.0. This is going be the driving factor to identify probable matches. I set at .90, so the understanding is to consider 90% probable matches. Rest are optional and if required can be checked.

image

Once the filter is applied the output clearly identifies the “Vincent Lauriant” as a probable duplicate contact. As part of this example I am going to filter out original records and show only the duplicate data. We can perform various operations at this stage to get the desired output. The simplest option I went for is to create index column to filter this data.

image

I have added a conditional column with a simple validation to filter out the non duplicates. I have used the “contactid” for that.

imageimage

Apply the filter to only show the duplicate data and our result set is ready. Publish the report to the Power BI online and set a schedule to refresh the dataset on a regular basis.

imageimage

This report can be published as a Dashboard report with in your Dynamics 365 CRM instance and users can constantly can validate the data. 

image