Name Matching Techniques: Useful Algorithms, Their Problems, & Absolute Solutions
A concise guide to Names & Text Matching Algorithms available and right way to decide the best algorithm on the basis of use case.
What is a name-matching problem?
Data is the new Oil, Analytics is the Refinery and Intelligence is the Gasoline which drives the Growth.
Tiffani Bova
Simple yet very preeminent quotation. If you look closely at the above quotation you will find absolutely each organization in the world independent of their size or location working only to achieve this goal. I you look more in-depth, you will find this is made up of 3 entities DATA, ANALYTICS, and INTELLIGENCE.
There is no Growth without Intelligence and there is no intelligence without analytics and there is no Analytics without Data.
Now that we know Data is the basic need for any organization and the Internet is the most common source of data. Let’s deep dive into the name-matching problem.
The Internet is overloaded with data, and there are numerous ways an organization can get data from multiple entities. Data is nothing but a resource for an organization but data without proper identification (Consistency and Integrity) will very soon turn into a liability if not nurtured properly.
Here Nurturing is nothing but combining this data with the right demographics(Person Names, ages, contacts, etc.). If an Organization has data without proper consistency then it might show wrong metrics which can lead the organization to bad decisions.
Name matching is important to identify a person’s behavior daily, weekly or occasionally.
For Example, A person looks for shoes during 8–9PM every 150 days using social media advertisements with the exception that he also goes shoe shopping around 10–15 days before Christmas using multiple search engines and e-commerce websites. This information can be used to target the right set of customers at right time.
If an Organization fails to identify the same person on multiple browsers or multiple social media or e-commerce platforms then it might lose out on some sales and capital on digital marketing.
A lot of organizations are involved in mining demographic information from across the applications like e-mail, customer or patient records, news articles, and business or political memorandums that they might have received.
Sources of name or text variation
By Now we have an understanding of:
- Why Data is important for an Organization!
- What is a name-matching problem and Why this is important for an Organization!
Hence, we will move next phase to understand where is the origin of this problem. Is there a way to identify the source so we can apply our algorithm on the basis of the source?
For Example, If Data is coming from a third-party application form where the limitation of name length is 20 letters then all the names which are of length 20 words, might be looked for abbreviations or as pruned data and can be handled accordingly and will not go for other costly algorithms. These few techniques could definitely save a significant amount of time and dollars.
These are a few examples that will help you identify the cases for your organization and will help you apply the right algorithm for the right problem.
- Handwritten OCR:
q and g
orm and rn
can be misrepresented as similar. - Manual Keyboard Data Entry or Over Phone: Neighbouring keyboard keys like
n and m
ore and r.
- Limitation of maximum length allowed: The input field force people to use abbreviations or initials only.
- Deliberately provide a modified name: Faith on the organization or to avoid public data.
- Different Demographics: The data entry persona does not know the right format for other demographics. Like some sitting in US can put the wrong spelling for Indian or Spanish names.
These are basic examples where an entity(End-user or Data entry professional) can create multiple variations of names for the same organization or data retrieved from multiple organizations.
Type and Examples of different referential / demographic data
Let’s take a few more examples where spelling mistakes or phonetic name variation could occur.
One of the most important parts of making analytics score better is to standardize referential data* which includes names and addresses. If we look only into the above-mentioned problem then we can divide the matching problem into two categories.
Name Matching
Persona Names Matching:
While there is only one correct spelling for many words, there are often several valid spelling variations for personal names. Examples:
- NickName like
'Bill' rather than 'William'
- Phonetically Similar like
'Gail', 'Gale' and 'Gayle'
- Personal names sometimes change over time, After marriage or religion change.
Company Names Matching:
- Abbreviations like
LTD, Ltd, Limited
orCorp, Corporation
orIBM and International Business Machine
can be used interchangeably across documents - Typographical Errors like
Oracle and Orcle
- Omissions like
Goyal & Sons and Goyals
Product Name Matching:
- Canon PowerShot a20IS”, “NEW powershot A20 IS from Canon” and “Digital Camera Canon PS A20IS” should all match “Canon PowerShot A20 IS”
Text Matching:
Address Matching:
- 134 Ashewood Walk, Summerhill Lane, Portlaoise
- 134 Summerhill Ln, Ashewood Walk, Summerhill, Portlaoise, Co. Laois, R32 C52X, Ireland
Product review and NLP fuzzy matching: Product reviews can be confusing due to very similar product names or review text do not have very clear picture. Example:
- John Snow: What’s so fuzzy about this?
- Danny: I think it was a good fuzz.
Similarly, there could be multiple other such examples where name or text matching can create a huge difference,s and collating all those information is very crucial.
Why Name matching is so hard
By Now we have an understanding of
- Why Data is important for an Organization!
- What is name-matching problem and Why this is important for an Organization!
- Sources and examples of multiple name variations
Now, let’s understand, why name matching is so hard and most companies are still not able to figure out the solution with 100% match score.
Names are heavily influenced by people’s cultural backgrounds and first names, middle names, and last names can be represented in different ways.
Names are often recorded with different spellings, and applying exact matching leads to poor results. The most common cases that I can come up with are:
- Spelling Errors(80%)
- Spelling Variation: ‘O’Connor’, ‘OConnor’ and ‘O Connor’
- Phonetical Similarity: Gail & Galye
- Tina Smith might be recorded as ‘Christine J. Smith’ and as ‘C.J.Smith-Miller’
- Short forms: Like ‘BOB for Robert, or ‘Liz’ for Elizabeth
- Some European countries favour compound. ‘Hans-Peter’ or ‘Jean-Pierre’
- Hispanic and Arabic names can contain more than two surnames.
- Record Linkage, where record contains more than just names
These examples show why name matching problem is so hard, though there are multiple ways to resolve the issues. Let’s discuss those techniques in detail.
Name Matching Techniques
By now, we understood the source of different names and their complexities. To improve the matching accuracy, many different techniques for approximate name matching have been developed in the four decades and new techniques are still being invented.
Most of the solutions are based on the pattern, phonetic, or combinations of these two approaches.
In this document, we will discuss each approach on a very high level, but I will add more blogs for detailing each technique in detail.
Please Note: The description for the algorithm has been reduced delibrately to avoid length of blog, There are seperate blogs for each algorithm.
Phonetic Encoding:
1. Soundex: It keeps the first letter in a string and converts the rest into numbers using below mapping/encoding table.
For example
2. Phonex: It is a variation of Soundex that tries to improve the encoding quality by pre-processing names according to their English pronunciation before the encoding
3. Phonix: This encoding goes a step further than phonex and applies more than one hundred transformation rules on groups of letters before encoding on the basis of the below encoding table. This algorithm is slow comparatively due to large number of transformations.
4. NYSIIS: (The New York State Identification Intelligence System) It is based on transformation rules similar to Phonex and Phonix, but it returns a code that is only made of letters.
5. Double Metaphone: Specialized in European and Asian names. For example ‘kuczewski’ will be encoded as ‘ssk’ and ‘xfsk’, accounting for different spelling variations.
6. Fuzzy Soundex: It is based q-gram substitution. Fuzzy Soundex technique is combined with a q-gram based pattern matching algorithm, and accuracy results better than Soundex.
Below table shows how diffrent algorithm will create encoding for same word.
Pattern Matching:
- Levenstein or Edit Distance: The smallest number of edit operations (insertions, deletions, and substitutions) required to change one string into another.
- Damerau-Levenstein Distance: measuring the difference between two sequences. It is a variant of the Levenshtein distance, with the addition of a provision for the transposition of two adjacent characters. The Damerau-Levenshtein distance between two strings is the minimum number of operations (consisting of insertions, deletions, substitutions, and transpositions of two adjacent characters) required to transform one string into the other
- Bag Distance: The bag distance algorithm compares two sets of items, and calculates the distance between them as the number of items that are present in one set but not the other. It is a simple and fast algorithm, but it does not take into account the order or frequency of the items in the sets.
- Simth-Waterman: is a dynamic programming algorithm used for local sequence alignment. It is used to align two sequences in a way that maximizes the number of matching characters while allowing for the insertion of gaps to optimize the alignment. It was originally developed to find optimal alignments between biological sequences, like DNA or proteins.
- Longest common sub-string (LCS): The longest common substring (LCS) is a string that is common to two or more strings and is the longest string that is a substring of all the strings. It is used to find the similarities between two or more strings and is often used in text comparison, data mining, and natural language processing. There are various algorithms for finding the LCS, including dynamic programming and suffix trees.
- Q-grams: The Q-gram algorithm is a method for comparing strings by breaking them down into fixed-length substrings, or “grams”, and comparing the set of grams for each string. It is a fast and simple algorithm, but it can be less accurate than other methods, as it does not take into account the order of the grams or the distances between them. Q-grams are often used in spelling correction, text search, and information retrieval.
- Positional Q-grams: Positional Q-grams is a variant of the Q-gram algorithm that takes into account the position of the grams within the string. It is used for comparing strings by breaking them down into fixed-length substrings and comparing the set of grams for each string, while also considering the positions of the grams in the original string. Positional Q-grams can be more accurate than regular Q-grams, as it takes into account the order of the grams within the string. It is often used in information retrieval and natural language processing.
- Skip-grams: skip-grams was developed as an experiment on multi-lingual texts from different European languages and show improved results compared to bigrams, trigrams, edit distance, and the longest common sub-string technique.
- Compression: Compression matching is a method for comparing strings by comparing their compressed representations. The idea is that strings that are similar will compress to a smaller size than strings that are dissimilar. To perform compression matching, the strings are first compressed using a lossless compression algorithm, such as gzip or bzip2. The compressed representations of the strings are then compared using a string distance metric, such as the Levenshtein distance or the Jaccard coefficient. Compression matching can be an effective way to compare strings, especially for large strings or datasets. However, it can be computationally expensive, as it requires the strings to be compressed and decompressed.
- Jaro: The Jaro algorithm is commonly used for name matching in data linkage systems. It accounts for insertions, deletions, and transpositions. The algorithm calculates the number c of common characters (agreeing on characters that are within half the length of the longer string) and the number of transpositions.
- Winkler: The Winkler algorithm improves upon the Jaro algorithm by applying ideas based on empirical studies which found that fewer errors typically occur at the beginning of names. The Winkler algorithm, therefore, increases the Jaro similarity measure for agreeing on initial characters (up to four)
- Sorted-Winkler: If a string contains more than one word (i.e. it contains at least one whitespace or other separators), then the words are first sorted alphabetically before the Winkler technique is applied (to the full strings). The idea is that (unless there are errors in the first few letters of a word) sorting of swapped words will bring them into the same order, thereby improving the matching quality.
- Permuted-Winkler: In this more complex approach Winkler comparisons are performed over all possible permutations of words, and the maximum of all calculated similarity values is returned.
Combined Techniques:
- Editex: Aim at improving phonetic matching accuracy by combining edit distance-based methods with the letter-grouping techniques of Soundex and Phonix. The edit costs in Editex are 0 if two letters are the same, 1 if they are in the same letter group, and 2 otherwise. Comparison experiments in [34] showed that Editex performed better than edit distance, q-grams, Phonix, and Soundex on a large database containing around 30,000 surnames. Similar to basic edit distance, the time and space com-
- Syllable alignment distance: This recently developed technique, called Syllable Alignment Pattern Searching (SAPS) [13] is based on the idea of matching two names syllable by syllable, rather than character by character. It uses the Phonix transformation (without the final numerical encoding phase) as a preprocessing step, and then applies a set of rules to find the beginning of syllables. An edit distance based approach is used to find the distance be- tween two strings.
Problem with current Techniques
We have discussed the characteristics of personal names and the potential sources of variations and errors in them, and we presented an overview of both pattern-matching and phonetical encoding-based name matching techniques. Experimental results on different real data sets have shown that there is no single best technique available. The characteristics of the name data to be matched, as well as computational requirements, have to be considered when selecting a name-matching technique.
How Python or Databases can be utilized
Python also has a lot of phenomenal libraries created by a lot of good developers. A few libraries are worth looking for:
- The Fuzz
- hnmi
- Fuzzy-wuzzy
- Namematcher
- dedupe
- TF-IDF Vectorizer scikit-learn
Not only python but other languages also have similar libraries which can provide the in-built functionality and which can save a few lines of complex code.
Some databases also provide some in-built matching functionalities which can definitely save some network bandwidth in your production system.
- Jaro-Winkler by Snowflake
- Jaccard Index by Snowflake
- Edit Distance by Snowflake
- Levenstein by Postgres
- Trigrams by Postgres
- AWS + Azure in-built Machine Learning
Conclusion
Personal name matching is very challenging, and more research into the characteristics of both name data and matching techniques has to be conducted in order to better understand why certain techniques perform better than others, and which techniques are most suitable for what type of data. A more detailed analysis of the types and distributions of errors is needed to better understand how certain types of errors influence the performance of matching techniques.
I hope you have learned something new today and please stay tuned for more such blogs. In near future, I have planned to add more detailed documentation for each of the algorithms where we will not only discuss each algorithm in detail with examples but also discuss their implementation and some real-time production use case that I have been involved.
References and Courtesy:
- The Australian National University(TR-CS-06–02) by Peter Christen
- https://undraw.co/
Appendix:
Originally published at https://singlequote.blog on January 2, 2023.