0

I have an issue where i need to link certain sample names to eachother, the problem however is that the sample names which i want to match are a little bit different from the keys in a dictionairy i have from which i need to get the correct value.

example:

sample = "foo_foo.bar.12"
matching_dict = {"foo_foo-bar-12" : "foo.bar.12"}

I have about 5500 samples, each with a different type of arrangement, so not every sample looks like the example i gave. Ideally i want a dynamic way of comparing the 2 strings with eachother and then get the value from the dict if they are most alike.

Thank in advance!

BenB
  • 3
  • 1

2 Answers2

2

You could use Levenshtein distance. This measures how similar two strings are to eachother. There is a very easy python libarary for it called python-levenshtein. With this you could compare your sample to all the values in the dictionary, and calculate which value in the dict has the lowest Levenshtein distance.

IsolatedSushi
  • 152
  • 2
  • 10
0

As peter wood suggested, you can try fuzzywuzzy. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. https://github.com/seatgeek/fuzzywuzzy

pip install fuzzywuzzy
from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")
>>> 97
Rahul Raut
  • 1,099
  • 8
  • 15