0

I want to look for permutations that match with a given word, and arrange my data based on column position.

IE - I created a CSV with data I scrapped from several websites.Say it looks something like this:

Name1     OtherVars    Name2      More Vars

Stanford   23451      Mamford        No
MIT          yes      stanfor1d       12
BeachBoys    pie      Beatles      Sweeden

I want to (1) find permutations of each word from Name1 in Name2, and then (2) print a table with that word from Name1+it's matching word in OtherVars + the permutation of that word in Name2+it's match in MoreVars. (if no matches found, just delete the word).

The outcome will be in this case:

Name1     OtherVars     Name2      More Vars

Stanford    23451      stanford       12

So, how do I:

  1. Find matching permutations for a word in other column?

  2. Print the 2 words and the values they are mapped to in other columns?

PS - here's a similar question; however, it's java and it's pseudo code. How to find all permutations of a given word in a given text? Difflib seems not to be suitable for CSVs based on this: How to find the most similar word in a list in python

PS2 - I was advised to use Fuzzymatch however, I suspect that it's an overkill in this case.

Community
  • 1
  • 1
oba2311
  • 373
  • 4
  • 12

2 Answers2

0

If you're looking for a function which returns the same output for "Stanford" and "stanf1ord", you could :

  • use lowercase
  • only keep letters
  • sort the letters


import re

def signature(word):
    return sorted(re.findall('[a-z]', word.lower()))

print(signature("Stanford"))
# ['a', 'd', 'f', 'n', 'o', 'r', 's', 't']
print(signature("Stanford") == signature("stanfo1rd"))
# True

You could create a set or dict of signatures from 1st column, and see if there's any match within the second column.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • Thank you, but I believe that this normalization you are suggesting is only the first step out of many in this problem. How to search for a good match that is close to the word, AFTER normalizing like you are suggesting. Are you suggesting to normalize all the data in the search space? – oba2311 Mar 21 '17 at 01:41
  • @oba2311 You only mentioned permutations in your question. Permutations are covered by my code. If you need more fuzzy logic, you'll need to define exactly what kind of and show that you tried something. – Eric Duminil Mar 21 '17 at 08:55
0

You seem to want fuzzy matching, not "permutations". There are a few python fuzzy matching libraries, but i think people like fuzzywuzzy

Alternatively, you can roll your own. Something like

def ismatch(s1,s2):
   # implement logic
   # return boolean if match
   pass

def group():
   pairs = [(n1, v1, n2, v2) for n1 in names1 for n2 in names2 if ismatch(n1,n2)]
   return pairs
marisbest2
  • 1,346
  • 2
  • 17
  • 30