14

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Hugo Migneron
  • 4,867
  • 1
  • 32
  • 52
  • 1
    But it's not one line @htw. you don't get any geek points if its not one line. – George Mauer Sep 14 '10 at 19:56
  • 8
    Don't let this program run in Charlotte, NC. Prominent road names happen to be East Blvd, South Blvd, West Blvd. Those are the names of the roads, not a differentiation of *now you're on West 1st Street.* On that note, there are other scenarios where your directions aren't really directions, but key parts of the identifier. Northampton, Northlake (mall/area in Charlotte), North Carolina, North Dakota, etc. – Anthony Pegram Sep 14 '10 at 19:57
  • @Anthony : This is true, I will be careful with what I put in my dictionary. However, I match with postal code (zip) first which must match exactly for the function to even consider the addresses. From there, I don't really mind if I'd rather get false positives then to miss results. – Hugo Migneron Sep 14 '10 at 20:06
  • Then you will be pleased to know that East, West, and South Blvds all intersect! They will share a zip! I'm convinced if you can get your program to run in Charlotte, you can get it to run anywhere. – Anthony Pegram Sep 14 '10 at 20:13
  • @Anthony : That sounds like a nightmare. Luckily, my program only really needs to work for canadian addresses. – Hugo Migneron Sep 14 '10 at 20:17
  • 1
    And Canada is totally free of North/South streets/boulevards? I think Anthony's comment was a lot more generic than your problem statement. – H H Sep 14 '10 at 20:37
  • I guess not, but this isn't really a problem for me. The program will only ever run for less than 10k people spread all over a province (and there will never be more). For the few people that share a postal code, I don't mind getting false positives. In my case, false positives are better than a result I miss. So in other words, if I remove too much and get a hit because of it, no big deal. – Hugo Migneron Sep 15 '10 at 00:45

11 Answers11

14

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

Gabe
  • 84,912
  • 12
  • 139
  • 238
  • This is a great solution, both shorter and clearer than the regex versions. – AHM Sep 14 '10 at 20:00
  • You might as well split by the words - `text.Split(ignoreList.ToArray(), StringSplitOptions.None)`. That said, it is easier to adapt your approach to ignore case. – Kobi Sep 14 '10 at 20:05
  • 1
    What about punctuation before or after words? – Mark Byers Sep 14 '10 at 20:07
  • Kobi: `text.Split(ignoreList.ToArray())` doesn't work for the same reason all the `string.Replace` methods don't work. – Gabe Sep 14 '10 at 20:09
  • 1
    Mark: Presumably he would want to consider punctuation to be word-breakers. It's up to him, but I'd guess he'd want `text.Split(new[]{' ','.',',','-'})` but he can tweak it to support whatever algorithm he has. – Gabe Sep 14 '10 at 20:13
  • @Gabe: Then it won't match words containing punctuation, such as `St.`. – Mark Byers Sep 14 '10 at 20:50
  • Mark: I would expect that if he wants to ignore `St.` and he wants `.` to be a word-breaker, he would just put `St` in his ignore list. – Gabe Sep 14 '10 at 22:25
  • Thanks a lot, this is a great solution. Very clean and readable. – Hugo Migneron Sep 15 '10 at 00:51
6
Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);
Bob
  • 3,301
  • 1
  • 16
  • 11
  • 1
    if there are special characters, you should escape the stuff in ignoreList: string.Join("|", ignoreList.select(s => Regex.Escape(s)).ToArray()) – Frank Schwieterman Sep 14 '10 at 19:54
  • Since odds are the list will contain words like `"St."`, escaping is advised. And you have to look only for whole words. – Gabe Sep 14 '10 at 20:04
  • 1
    @Frank Correct . . . though it isn't really specified where the list comes from. It would probably be easiest to just write the correct regular expression in the first place rather than to convert it from a list, unless the list is really necessary. – Bob Sep 14 '10 at 20:15
  • Yeah, building a Regex dynamically is only really worthwhile if the list contents might change. Using a Regex in general is only useful if this function is used alot as its potentially faster then N string replacements. – Frank Schwieterman Sep 14 '10 at 20:59
5

Something like this should work:

string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
  return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}
George Mauer
  • 117,483
  • 131
  • 382
  • 612
  • 1
    I might have swapped around the parameters to the second lambda but this will definitely work, Aggregate is an incredibly powerful method, its lame people don't use it very often – George Mauer Sep 14 '10 at 19:52
  • 1
    It should be noted that I doubt that calling Replace multiple times is not the most preformant way of doing this. Probably something where you build the contents of the list into a static RegEx and use that to replace would be faster, but I suspect the difference won't matter in this case. – George Mauer Sep 14 '10 at 19:54
  • This is not correct because it uses `string.Replace` which can't match only on a word boundary. If you're going to use a RegEx, though, it should use a single compiled one. – Gabe Sep 14 '10 at 20:06
  • Good point @Gabe the example is more about the usage of Aggregate than of Replace. – George Mauer Sep 14 '10 at 20:10
3

What's wrong with a simple for loop?

string street = "14th Avenue North";
foreach (string word in ignoreList)
{
    street = street.Replace(word, string.Empty);
}
Albin Sunnanbo
  • 46,430
  • 8
  • 69
  • 108
2

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:

string s = "14th Avenue North";
Regex regex = new Regex(string.Format(@"\b({0})\b",
                        string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");

Result:

14th Avenue 

If there are special characters you will need to fix two things:

  • Use Regex.Escape on each element of ignore list.
  • The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.

Here's how to fix these two problems:

Regex regex = new Regex(string.Format(@"(?<= |^)({0})(?= |$)",
    string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
1

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:

address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));

If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

Community
  • 1
  • 1
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • Thanks a lot for that. My ignore list will obviously be much longer than what I posted here, but not sure if it will be long enough to use your method. I will profile it and see though. – Hugo Migneron Sep 14 '10 at 19:57
1

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.

List<string> ignoreList = new List<string>()
{
    "North",
    "South",
    "East",
    "West"
};    

string s = "123 West 5th St"
        .Split(' ')  // Separate the words to an array
        .ToList()    // Convert array to TList<>
        .Except(ignoreList) // Remove ignored keywords
        .Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string
Phil Gilmore
  • 1,286
  • 8
  • 15
0
public static string Trim(string text)
{
   var rv = text;
   foreach (var ignore in ignoreList)
      rv = rv.Replace(ignore, "");
   return rv;
}

Updated For Gabe


public static string Trim(string text)
{
   var rv = "";
   var words = text.Split(" ");
   foreach (var word in words)
   {
      var present = false;
      foreach (var ignore in ignoreList)
         if (word == ignore)
            present = true;
      if (!present)
         rv += word;
   }
   return rv;
}
Gabe
  • 84,912
  • 12
  • 139
  • 238
Umair A.
  • 6,690
  • 20
  • 83
  • 130
0

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.

Here's a start:

(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)

If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

Brad
  • 15,361
  • 6
  • 36
  • 57
0

Why not juts Keep It Simple ?

public static string Trim(string text)
{
   var rv = text.trim();
   foreach (var ignore in ignoreList) {
      if(tv.EndsWith(ignore) {
      rv = rv.Replace(ignore, string.Empty);
   }
  }
   return rv;
}
0

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:

string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "
Øyvind Bråthen
  • 59,338
  • 27
  • 124
  • 151