My love for photography

I first knew I had an interest in photography when I won a small village competition at the age of 14. The image showcased an old door among some gravestones at a local church near me. I wish I had…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Clean Matching with FuzzyWuzzy

String Matching in Python with use of the Levenshtein Distance

Close, but no cigar

Python comes equipped with numerous ways to handle inconsistencies in strings as well as overall language (cue Natural Language Processing) to vectorize and transform words into usable formats for analysis. With methods such as .upper(), .title(), .strip(), .replace(), etc. available to format text, matching two similar strings becomes quite simple. However, Python has proved to me time and time again that there is almost always a simpler method to the madness— or to receiving error messages.

Here is where I discovered the FuzzyWuzzy package. When attempting to merge two DataFrames with various names that were mostly related but not exactly, I wound up exhausting different types of joins and concatenations, as well as hours of mental clarity, just trying to not lose majority of the rows that could not be directly matched on. Merging with Pandas is more temperamental than with FuzzyWuzzy, which allows you to set a threshold of the minimum distance between two strings that you would desire for a match.

Named after mathematician Vladimir Levenshtein, this distance computes the required number of character edits to directly change one string into another. The distance between two strings is quantified by indices locations and the computational cost needed to move from one index to another.

To understand the implementation of the Levenshtein Distance Algorithm:

When using this function, we can determine the value associated with the Levenshtein distance between two strings as well as the number of edits required to have a direct match.

Uncommenting the print(distance) portion of the above function returns a matrix that visually explains the computational cost of matching the strings. The number in the bottom right corner denotes the required number of edits to match the strings, with each horizontally and vertically adjacent value representing an insertion and deletion respectively. The diagonally adjacent value can be a cost of 1 if the two characters in the row/column do not match, and a cost of 0 if they do mach. Each cell aims to minimize the local cost of computing the differences. Below, we have the matrix of the character differences between the strings ‘3 Bears OG’ and ‘Three Bears og’.

The red 1 denotes that 1 operation is needed to turn the M into an empty string

Beautiful but verbose, so let’s make this even easier by just installing and importing the Levenshtein package.

The Levenshtein package returns the number of edits required to match the strings, rather than the actual distance between the two when calling the distance function. To put the value acquired from the distance into use, we apply fuzzywuzzy to our strings or lists of strings to gather the associated terms. Fuzzywuzzy makes use of the Levenshtein distance through it’s ratio function that calculates the character differences between two strings.

This above function is aimed towards taking in various names from different DataFrames for merging into one DataFrame based on similarity. For my particular use, I input names of various strains of cannabis for matching the deviations in text across different websites. For instance, as seen above, the strain 3 Bears OG was present in all three of my DataFrames for the project I worked on, but was written as ‘Three Bears OG’ and ‘3 Bears Og’ on different databases which gave errors in merging due to the capitalization and spelling differences.

The above code does not include the distance score between the two strings, but this can easily be shown by updating the dictionary created with the value of the match, which would appear as a new column in the DataFrame. Only the first match is displayed, as denoted by match[0].

Sample of Matched Strings

Here, we can see that strains with lowercase values were matched with those in the title case without having done any text pre-processing. Similarly, apostrophes are given a weight as well in calculating whether Blue’s Dream is most similar to Blue Dream or Blues Dream.

FuzzyWuzzy can be used along with NLP to improve the use of categorical variables in various aspects of data science and analysis. In it’s simplest form, it can just be a fun way to see how much work is needed to get two strings to be the same.

Add a comment

Related posts:

Making plans for our home renovation

Three years ago we moved into our first house. It’s a three storey, solid stone doer-upper in a lovely town in South Wales which we thought we’d be able to improve ourselves. It had taken us a while…

Create ASP.NET Core Web API for Ethereum DApps

Recently I have discovered Nethereum, an open source library for creating .NET applications which can interoperate with Ethereum and smart contracts. The framework has been actively developed for…

This is Huluween

Huluween is a hallowed Hulu tradition and arguably the most anticipated event of the year. We love Huluween so much, it's scary.