3

What would be an efficient way to remove similar strings from a list?

Consider a List<string> consisting of these (and other) strings:

"SRS INVESTMENT MANAGEMENT, LLC"

"SRS INVESTMENT MANAGEMENT"

"Maplelane Capital, Ltd."

"Maplelane Capital, Limited"

So what I need to do is remove strings that are 'similar enough'. My idea is that this should be done by capitalising all the strings of the list, and then remove any string that matches all except the last X characters of another string. In the end I want this to leave me with a list containing only one string for each real-life company that they actually represent.

Any ideas on how I can achieve this?

Draken
  • 3,134
  • 13
  • 34
  • 54
Agent Shoulder
  • 586
  • 8
  • 22
  • 3
    You should use Levenshtein distance for strings to measure how close they are and remove any strings under a certain threshold. You might want to do some fancy work if you want to end up with the strings that result in a global minimum of some sort. This doesn't sound all that trivial. – Jashaszun Jun 14 '16 at 16:35
  • How many items in the list (approximately)? – spender Jun 14 '16 at 16:36
  • 2
    Are you guaranteed that all the names in the list have representations that are drawn from the standard American English 26 letter alphabet? For example, if you need to make "Zürich Financial Services" match with "Zurich Financial Services, Ltd", then you have a harder problem on your hands. Those strings have a mismatch in the second character. – Eric Lippert Jun 14 '16 at 16:39
  • It is around 10-20 strings in the list @spender – Agent Shoulder Jun 14 '16 at 16:46
  • @EricLippert It could potentially contain any character, also including apostrophes, commas, etc.. – Agent Shoulder Jun 14 '16 at 16:51
  • Step 1. Find government-controlled APIs for looking up registered business names. Step 2. Query each entry against the API to get a canonical name or ID. – Brian Jun 14 '16 at 18:52

2 Answers2

2

You could start by creating a routine to replace obvious abbreviations with the full word, then remove white space. The good news is that Companies house has very strict rules about company names. For example, you cannot have a company called 'B & C Ltd', if 'Band C Ltd' already exists. After that you will need to start thinking about matching algorithms, such as Levenshteins and Soundex.

JonnyCab
  • 104
  • 7
  • I experimented replacing abbreviations and such, and it does seem to work, although the list of things to replace does get extensive.. I figure this is a good starting point, before using the solution proposed by @CharlesNRice – Agent Shoulder Jun 16 '16 at 01:52
  • If it is company names you're trying to match, you need to be very careful with the algorithms. We spent years developing our business matching software and every client has a different set of requirements. My best advise would be to keep eyeballing the results until you find an approach that gets the right balance for the individual project. – JonnyCab Jun 17 '16 at 11:35
1

I would suggest you create an IEqualityComparer to encapsulate the logic to determine if two strings are equal.

An example if you wanted to mix and match SoundEx and Levenshtein might be something like

public class CompanyNameComparer : IEqualityComparer<string>
{

    public bool Equals(string x, string y)
    {
        if (x == null && y == null)
        {
            return true;
        }
        if (x == null || y == null)
        {
            return false;
        }

        var src1 = FormatString(x);
        var src2 = FormatString(y);

        if (src1 == src2)
        {
            return true;
        }

        var difference = CalcLevenshteinDistance(src1, src2);

        // arbitrary number you will need to find what works
        return difference < 7;
    }

    private string FormatString(string source)
    {
        return source.Trim().ToUpper();
    }

    // code taken from http://stackoverflow.com/a/9453762/1798889
    private int CalcLevenshteinDistance(string a, string b)
    {
       // code not included 
    }

    public int GetHashCode(string obj)
    {
        return Soundex(obj).GetHashCode();
    }

    private string Soundex(string data)
    {
        // code not included 
    }
}

I didn't include all the code because that's not the main point. Only you will know if SoundEx and Levenshtein will work or if it needs to be something else. But if you put that decision making in it's own class if it needs to be tweaked it's just one place that needs to be changed.

Then you can get a unique list either with Linq or a HashSet. Assuming data is the name of your variable of a List

var uniqueEnumerable = data.Distinct(new CompanyNameComparer());
var uniqueSet = new HashSet<string>(data, new CompanyNameComparer());
CharlesNRice
  • 3,219
  • 1
  • 16
  • 25