2

I've got a problem which arises from attempting to pair arrays based upon strings. Basically, I have a list of baseball players with their relevant stats. What I've found is that various sites use different spellings for the same players, i.e. "Steve" vs "Stephen." Obviously this throws a loop in pure search functions.

With this said, I am considering using Levenshtein Python extension and C library.However, I am not sure how to implement this in an efficient way. In theory I could loop through the entire list for each name in the base list, but this is a last resort. Isn't there a better way to do this?

Alex Ketay
  • 918
  • 3
  • 10
  • 20
  • How many players are there? Any strong priors on the types of misspellings commited? E.g. one could potentially assume that the first letter of the name will be correct (if it's a completely different nickname, then levenshtein won't find it for you either). In that case you can cut the set into 26 and work within them. Other priors would lead to other segmentations. Without priors there may be some reasoning possible based on the triangular equality. – eickenberg Apr 30 '14 at 16:54
  • @eickenberg The list isn't too long. Unfortunately, I don't know about priors. Perhaps I should just create a "did not match" list to catch the outliers. – Alex Ketay Apr 30 '14 at 18:04
  • "Isn't too long" meaning you can brute force it? Then brute force it - you only need to do it once, right? – eickenberg Apr 30 '14 at 19:53
  • Sure, it's just that I think its always better to look for the more efficient and elegant way of doing things. In the long run it tends to pay off as your projects become more and more complex. – Alex Ketay Apr 30 '14 at 22:14

0 Answers0