Rule-based deduplication has long been the go-to approach for deduping Salesforce environments, but is this the smartest way to go about it? Is there a less time-consuming way to cleanse your data without the need to standardize or otherwise prepare your data?
Yes, a machine learning-based approach to deduplication not only saves time by eliminating the need to create rules or filters, but it improves accuracy and ease in managing contacts. So, you could say that machine learning offers the smartest approach to deduplication; and in this article, we’ll tell you why.
How Does Rule-Based Deduplication Work?
Whenever one of your sales professionals, marketing team members, and other Salesforce users encounter a duplicate record, they notify the Salesforce admin who proceeds to create a rule to prevent this from occurring in the future. This process would repeat itself over and over every time a new duplicate is discovered.
Not only is such rule creation time-consuming, but it is also a futile effort since there is no way you could possibly account for every possible “fuzzy” duplicate. Also, as more rules are created over time, they could start conflicting with each other or blocking leads from coming in. For example, if the rules are not set up correctly, they could block web-to-lead submissions from coming in.
Salesforce’s built-in deduplication functionality is also rule-based, but it is very limited. For example, you will not be able to dedupe custom objects or perform cross-object deduplication. You can only merge three duplicates at a time, with many other limitations. A lot of the deduplication apps on the AppExchange have significantly expanded these functionalities, but they kept the same rule-based approach. This means that the user has to deal with all of the problems associated with rule creation. Now, let’s shift our attention to how the rule-based approach differs from the machine learning-based approach.
How Does Machine Learning Work for Deduplication of Contacts?
If we were to look at the two records below, it would be pretty obvious that they are duplicates:
|First Name||Last Name||Address|
|Al||Pacinofirstname.lastname@example.org||4000 Warner Blvd., Burbank, CA 91522|
|Alfredoemail@example.com||4000 Warner Boulevard, Burbank, CA 91522|
Even though we can be certain that these are duplicates, could we explain why? It’s actually harder than you might think, but is exactly what needs to be done if we are to create a machine that can do this job for us. Perhaps a good place to start would be the similarities, but then you run into another problem of defining what you mean by “similar”. Are there gradations to “similar”? If so, what are they? Which similarities automatically indicate that two records are duplicates.
One of the ways researchers train machine learning algorithms to identify similar records is by using string metrics, which is a way of taking two strings and returning a number that is low if the strings are similar and high if they are not similar. There are many different string metrics and going into detail about each one is beyond the scope of this article, but let’s take a look at a couple of them.
One of the most commonly used string metrics is called the Hamming distance which counts the number of substitutions that must be made to turn one string into another. For example, if we return to the example with the records above, there only needs to be one substitution made to turn “pacino” into “Pacino,” so the Hamming distance would be 1.
There are also the learnable distance metrics that take into consideration that different edit operations have varying significance in different domains. For example, if we were to change one digit in the address or the zip code, we are basically changing the entire address. However, if we make a change in the street name, this may not be as significant since this could have been by mistake or an abbreviation. The AI and machine learning system effectively replicate the human thought process, making it much smarter than simply creating rules that filter out duplicates. This brings us to the next section where we will look at the benefits of the machine learning approach.
What are the Benefits of Machine Learning-Based Deduplication?
Artificial Intelligence (AI) allows machines to think like humans and, in our situation, it allows the system to learn from your data which records are duplicates and which ones are unique. This process is called Active Learning. Basically, as you label each record as a duplicate (or not) the system automatically learns which record fields are the most important, thus assigning them more weight than others. For example, if the “Email” field is more important than the “First Name” field it will be able to calculate exactly how much more important it is, something that’s not possible for a human to do. It will then proceed to assign those same field weights to subsequent records as you add them to your Salesforce.
From this, we see that the system does all of the work for you, which is one of the main advantages AI has to offer. There are no complex rules to set, there is no need to standardize your data and this approach is more scalable than the rule-based deduplication approach.
Trust Machine Learning to Dedupe Your Salesforce Data
If you are not satisfied with your current rule-based deduping tool or you are spending too much time creating rules, consider switching over to a platform that applies machine learning. As we discussed above, it is a smarter approach and it will do a much more comprehensive deduping job. DataGroomr is the only machine learning-based Salesforce deduplication app on the AppExchange. It will help you to realize all of the benefits of machine learning and do a more thorough data cleansing job.