string replace using a List

Question

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

But it's not one line @htw. you don't get any geek points if its not one line. — George Mauer, Sep 14 '10 at 19:56
Don't let this program run in Charlotte, NC. Prominent road names happen to be East Blvd, South Blvd, West Blvd. Those are the names of the roads, not a differentiation of *now you're on West 1st Street.* On that note, there are other scenarios where your directions aren't really directions, but key parts of the identifier. Northampton, Northlake (mall/area in Charlotte), North Carolina, North Dakota, etc. — Anthony Pegram, Sep 14 '10 at 19:57
@Anthony : This is true, I will be careful with what I put in my dictionary. However, I match with postal code (zip) first which must match exactly for the function to even consider the addresses. From there, I don't really mind if I'd rather get false positives then to miss results. — Hugo Migneron, Sep 14 '10 at 20:06
Then you will be pleased to know that East, West, and South Blvds all intersect! They will share a zip! I'm convinced if you can get your program to run in Charlotte, you can get it to run anywhere. — Anthony Pegram, Sep 14 '10 at 20:13
@Anthony : That sounds like a nightmare. Luckily, my program only really needs to work for canadian addresses. — Hugo Migneron, Sep 14 '10 at 20:17
And Canada is totally free of North/South streets/boulevards? I think Anthony's comment was a lot more generic than your problem statement. — H H, Sep 14 '10 at 20:37
I guess not, but this isn't really a problem for me. The program will only ever run for less than 10k people spread all over a province (and there will never be more). For the few people that share a postal code, I don't mind getting false positives. In my case, false positives are better than a result I miss. So in other words, if I remove too much and get a hit because of it, no big deal. — Hugo Migneron, Sep 15 '10 at 00:45

Gabe · Accepted Answer · 2010-09-14T20:00:06.893

14

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

edited Sep 14 '10 at 20:00

answered Sep 14 '10 at 19:54

Gabe

84,912
12
139
238

This is a great solution, both shorter and clearer than the regex versions. – AHM Sep 14 '10 at 20:00
You might as well split by the words - `text.Split(ignoreList.ToArray(), StringSplitOptions.None)`. That said, it is easier to adapt your approach to ignore case. – Kobi Sep 14 '10 at 20:05
1

What about punctuation before or after words? – Mark Byers Sep 14 '10 at 20:07
Kobi: `text.Split(ignoreList.ToArray())` doesn't work for the same reason all the `string.Replace` methods don't work. – Gabe Sep 14 '10 at 20:09
1

Mark: Presumably he would want to consider punctuation to be word-breakers. It's up to him, but I'd guess he'd want `text.Split(new[]{' ','.',',','-'})` but he can tweak it to support whatever algorithm he has. – Gabe Sep 14 '10 at 20:13
@Gabe: Then it won't match words containing punctuation, such as `St.`. – Mark Byers Sep 14 '10 at 20:50
Mark: I would expect that if he wants to ignore `St.` and he wants `.` to be a word-breaker, he would just put `St` in his ignore list. – Gabe Sep 14 '10 at 22:25
Thanks a lot, this is a great solution. Very clean and readable. – Hugo Migneron Sep 15 '10 at 00:51

Bob · Answer 2 · 2010-09-14T20:12:11.970

6

Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);

edited Sep 14 '10 at 20:12

answered Sep 14 '10 at 19:50

Bob

3,301
1
16
11

1

if there are special characters, you should escape the stuff in ignoreList: string.Join("|", ignoreList.select(s => Regex.Escape(s)).ToArray()) – Frank Schwieterman Sep 14 '10 at 19:54
Since odds are the list will contain words like `"St."`, escaping is advised. And you have to look only for whole words. – Gabe Sep 14 '10 at 20:04
1

@Frank Correct . . . though it isn't really specified where the list comes from. It would probably be easiest to just write the correct regular expression in the first place rather than to convert it from a list, unless the list is really necessary. – Bob Sep 14 '10 at 20:15
Yeah, building a Regex dynamically is only really worthwhile if the list contents might change. Using a Regex in general is only useful if this function is used alot as its potentially faster then N string replacements. – Frank Schwieterman Sep 14 '10 at 20:59

score 5 · Answer 3 · answered Sep 14 '10 at 19:47

5

Something like this should work:

string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
  return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}

answered Sep 14 '10 at 19:47

George Mauer

117,483
131
382
612

1

I might have swapped around the parameters to the second lambda but this will definitely work, Aggregate is an incredibly powerful method, its lame people don't use it very often – George Mauer Sep 14 '10 at 19:52
1

It should be noted that I doubt that calling Replace multiple times is not the most preformant way of doing this. Probably something where you build the contents of the list into a static RegEx and use that to replace would be faster, but I suspect the difference won't matter in this case. – George Mauer Sep 14 '10 at 19:54
This is not correct because it uses `string.Replace` which can't match only on a word boundary. If you're going to use a RegEx, though, it should use a single compiled one. – Gabe Sep 14 '10 at 20:06
Good point @Gabe the example is more about the usage of Aggregate than of Replace. – George Mauer Sep 14 '10 at 20:10

score 3 · Answer 4 · answered Sep 14 '10 at 19:48

3

What's wrong with a simple for loop?

string street = "14th Avenue North";
foreach (string word in ignoreList)
{
    street = street.Replace(word, string.Empty);
}

answered Sep 14 '10 at 19:48

Albin Sunnanbo

46,430
8
69
108

Mark Byers · Answer 5 · 2010-09-14T20:59:39.140

2

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:

string s = "14th Avenue North";
Regex regex = new Regex(string.Format(@"\b({0})\b",
                        string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");

Result:

14th Avenue

If there are special characters you will need to fix two things:

Use Regex.Escape on each element of ignore list.
The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.

Here's how to fix these two problems:

Regex regex = new Regex(string.Format(@"(?<= |^)({0})(?= |$)",
    string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));

edited Sep 14 '10 at 20:59

answered Sep 14 '10 at 19:55

Mark Byers

811,555
193
1,581
1,452

It's a pretty good bet that his words *will* need escaping, because they'll be like `"St.", "Blvd.", "Rd."` – Gabe Sep 14 '10 at 20:03
That's a great way to handle the space problem raised in another comment. – Hugo Migneron Sep 14 '10 at 20:03
This is very clever and it seems like it would work on all the words. I will write some tests for it and try it out properly. – Hugo Migneron Sep 14 '10 at 20:15

score 1 · Answer 6 · edited May 23 '17 at 12:01

1

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:

address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));

If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

edited May 23 '17 at 12:01

Community

1
1

answered Sep 14 '10 at 19:53

Guffa

687,336
108
737
1,005

Thanks a lot for that. My ignore list will obviously be much longer than what I posted here, but not sure if it will be long enough to use your method. I will profile it and see though. – Hugo Migneron Sep 14 '10 at 19:57

score 1 · Answer 7 · answered Sep 14 '10 at 21:30

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.

List<string> ignoreList = new List<string>()
{
    "North",
    "South",
    "East",
    "West"
};    

string s = "123 West 5th St"
        .Split(' ')  // Separate the words to an array
        .ToList()    // Convert array to TList<>
        .Except(ignoreList) // Remove ignored keywords
        .Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string

The `.ToList()` is unnecessary. – Gabe Sep 14 '10 at 22:28 — Gabe, Sep 14 '10 at 22:28

score 0 · Answer 8 · edited Sep 14 '10 at 22:27

0

public static string Trim(string text)
{
   var rv = text;
   foreach (var ignore in ignoreList)
      rv = rv.Replace(ignore, "");
   return rv;
}

Updated For Gabe

public static string Trim(string text)
{
   var rv = "";
   var words = text.Split(" ");
   foreach (var word in words)
   {
      var present = false;
      foreach (var ignore in ignoreList)
         if (word == ignore)
            present = true;
      if (!present)
         rv += word;
   }
   return rv;
}

edited Sep 14 '10 at 22:27

Gabe

84,912
12
139
238

answered Sep 14 '10 at 19:47

Umair A.

6,690
20
83
130

No LINQ, not RegExp, yet it's correct. Only thing I'd change is the use of an empty string literal. – Steven Sudit Sep 14 '10 at 19:49
7

No, not correct. This will turn "123 Northampton" into "123 ampton". – Gabe Sep 14 '10 at 19:50
Close...now you need to make sure that you put back the space between words. – Gabe Sep 14 '10 at 22:29

Brad · Answer 9 · 2010-09-15T16:28:13.380

0

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.

Here's a start:

(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)

If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

edited Sep 15 '10 at 16:28

answered Sep 14 '10 at 19:48

Brad

15,361
6
36
57

I guess we could. Do we really want to, though? – Steven Sudit Sep 14 '10 at 19:50
This is a good start. Now make it so that it only matches whole words. – Gabe Sep 14 '10 at 19:52
We used this approach to flag a huge list of customers as business or residential based on RegEx keywords generated from looking at the data. – Brad Sep 14 '10 at 20:15

score 0 · Answer 10 · answered Sep 14 '10 at 19:52

0

Why not juts Keep It Simple ?

public static string Trim(string text)
{
   var rv = text.trim();
   foreach (var ignore in ignoreList) {
      if(tv.EndsWith(ignore) {
      rv = rv.Replace(ignore, string.Empty);
   }
  }
   return rv;
}

answered Sep 14 '10 at 19:52

Damian Leszczyński - Vash

30,365
9
60
95

score 0 · Answer 11 · answered Sep 14 '10 at 19:58

0

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:

string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "

answered Sep 14 '10 at 19:58

Øyvind Bråthen

59,338
27
124
151

string replace using a List

11 Answers11

Linked