7

I have a string comparison issue that - for the most part - behaves as expected, but is leaving me with a large number f duplicate DB insertions because my code is not detecting the string pairs as duplicate.

I thought I had narrowed it down to a culture issue (Cyrillic characters), which I resolved, but I'm now getting 'false negatives' (two apparently equal strings showing up as not-equal).

I've looked at the following similar questions and tried the following comparison approaches.

Similar SO questions that I've checked:

Here's an example of the strings being compared: (title and description)

feed title: Ellsberg: He's a hero

feed desc: Daniel Ellsberg tells CNN's Don Lemon that NSA leaker Edward Snowden showed courage, has done an enormous service.

db title: Ellsberg: He's a hero

db desc: Daniel Ellsberg tells CNN's Don Lemon that NSA leaker Edward Snowden showed courage, has done an enormous service.

My app compares values fetched from RSS feeds with values I have in the DB and should only insert "new" values.

//fetch existing articles from DB for the current feed:
    List<Article> thisFeedArticles = (from ar in entities.Items
                                      where (ar.ItemTypeId == (int)Enums.ItemType.Article) && ar.ParentId == feed.FeedId
                                      && ar.DatePublished > datelimit
                                      select new Article
                                      {
                                           Title = ar.Title, 
                                           Description = ar.Blurb
                                      }).ToList();

Everyone of the below comparison show no match for the Ellsberg title/description. i.e. matches1 to matches6 all have Count()==0

(please excuse the enumerated variable names - they are just for testing)

   // comparison methods 
CompareOptions compareOptions = CompareOptions.OrdinalIgnoreCase;
CompareOptions compareOptions2 = CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace;
//1
IEnumerable<Article> matches = thisFeedArticles.Where(b =>
    String.Compare(b.Title.Trim().Normalize(), a.Title.Trim().Normalize(), CultureInfo.InvariantCulture, compareOptions) == 0 &&
    String.Compare(b.Description.Trim().Normalize(), a.Description.Trim().Normalize(), CultureInfo.InvariantCulture, compareOptions) == 0
    );

//2
IEnumerable<Article> matches2 = thisFeedArticles.Where(b =>
    String.Compare(b.Title, a.Title, CultureInfo.CurrentCulture, compareOptions2) == 0 &&
    String.Compare(b.Description, a.Description, CultureInfo.CurrentCulture, compareOptions2) == 0
    );

//3
IEnumerable<Article> matches3 = thisFeedArticles.Where(b =>
    String.Compare(b.Title, a.Title, StringComparison.OrdinalIgnoreCase) == 0 &&
    String.Compare(b.Description, a.Description, StringComparison.OrdinalIgnoreCase) == 0
    );

//4
IEnumerable<Article> matches4 = thisFeedArticles.Where(b =>
    b.Title.Equals(a.Title, StringComparison.OrdinalIgnoreCase) &&
    b.Description.Equals(a.Description, StringComparison.OrdinalIgnoreCase)
    );

//5
IEnumerable<Article> matches5 = thisFeedArticles.Where(b =>
    b.Title.Trim().Equals(a.Title.Trim(), StringComparison.InvariantCultureIgnoreCase) &&
    b.Description.Trim().Equals(a.Description.Trim(), StringComparison.InvariantCultureIgnoreCase)
    );

//6
IEnumerable<Article> matches6 = thisFeedArticles.Where(b =>
    b.Title.Trim().Normalize().Equals(a.Title.Trim().Normalize(), StringComparison.OrdinalIgnoreCase) &&
    b.Description.Trim().Normalize().Equals(a.Description.Trim().Normalize(), StringComparison.OrdinalIgnoreCase)
    );


    if (matches.Count() == 0 && matches2.Count() == 0 && matches3.Count() == 0 && matches4.Count() == 0 && matches5.Count() == 0 && matches6.Count() == 0 && matches7.Count() == 0)
    {
    //insert values
    }

    //this if statement was the first approach
    //if (!thisFeedArticles.Any(b => b.Title == a.Title && b.Description == a.Description)
    // {
    // insert
    // }

Obviously I have only been using one of the above options at a time.

For the most part, the above options do work and most duplicates are detected, but there are still duplicates slipping through the cracks - I just need to understand what the "cracks" are, so any suggestions would be most welcome.

I did even try converting the strings to byte arrays and comparing those (deleted that code a while ago, sorry).

the Article object is as follows:

    public class Article
    {
        public string Title;
        public string Description;
    }

UPDATE:

I've tried Normalizing the strings as well as including the IgnoreSymbols CompareOption and I am still getting a false negative (non-match). What I am noticing though, is that apostrophes seem to make a consistent appearance in the false non-matches; so I'm thinking it might be a case of apostrophe vs single-quote i.e. ' vs ’ (and the like), but surely IgnoreSymbols should avoid that?

I found a couple more similar SO posts: C# string comparison ignoring spaces, carriage return or line breaks String comparison: InvariantCultureIgnoreCase vs OrdinalIgnoreCase? Next step: try using regex to strip white space as per this answer: https://stackoverflow.com/a/4719009/2261245

UPDATE 2 After the 6 comparison STILL returned no matches, I realised that there had to be another factor skewing the results, So I tried the following

//7
IEnumerable<Article> matches7 = thisFeedArticles.Where(b =>
    Regex.Replace(b.Title, "[^0-9a-zA-Z]+", "").Equals(Regex.Replace(a.Title, "[^0-9a-zA-Z]+", ""), StringComparison.InvariantCultureIgnoreCase) &&
    Regex.Replace(b.Description, "[^0-9a-zA-Z]+", "").Equals(Regex.Replace(a.Description, "[^0-9a-zA-Z]+", ""), StringComparison.InvariantCultureIgnoreCase)
    );

this DOES find the matches the others miss!

the string below got through all 6 comparisons, but not number 7:

a.Title.Trim().Normalize() and a.Title.Trim() both return:

"Corrigendum: Identification of a unique TGF-β–dependent molecular and functional signature in microglia"

Value in the DB is:

"Corrigendum: Identification of a unique TGF-ß–dependent molecular and functional signature in microglia"

Closer inspection shows that the German 'eszett' character is different in the DB compared to what's coming through from the feed: β vs ß

I would have expected at least one of comparisons 1-6 to pick that up...

Interestingly, after some performance comparisons, the Regex option is by no means the slowest of the seven. Normalize appears to quite more intensive than the Regular Expression! Here are the Stopwatch durations for all seven when the thisFeedArticles object contains 12077 items

Time elapsed: 00:00:00.0000662
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000016

Community
  • 1
  • 1
Adam Hey
  • 1,512
  • 1
  • 20
  • 24
  • 2
    If those strings come from a database, it may be relevant to know how the database column is declared, and the collation order used. – Kris Vandermotten Sep 12 '14 at 13:19
  • 1
    "but I'm now getting 'false positives' (two apparently equal strings) showing up as not-equal." isn't that a 'false negative'? – Matt Burland Sep 12 '14 at 13:20
  • 1
    also check length of comparing strings, maybe they contain some invisible characters – Charlie Sep 12 '14 at 13:20
  • 3
    You might also consider normalizing the strings, see http://msdn.microsoft.com/en-us/library/System.String.Normalize.aspx – Kris Vandermotten Sep 12 '14 at 13:22
  • 1
    Yea, I'd agree with @MattBurland. Also, can you post the actual strings you're having problems with? If you copy paste them into literals does the comparison still fail? Are there hidden characters in the data? – Jeff B Sep 12 '14 at 13:25
  • 1
    To figure out if hidden characters are your issue, you could try another comparison method that does a byte-level comparison of the strings (using string.ToCharArray(), Encoding.XXX.GetBytes(), or something of that nature). – Rudism Sep 12 '14 at 14:45
  • You could try adding CompareOptions.IgnoreSymbols. This would handle the case where there are differing whitespace characters at the end of the string. – epotter Sep 12 '14 at 15:03
  • Thanks for the suggestions guys! much appreciated. I'll try them and report back. Matt - yeah 'false negative' is more correct. Copying the into string literals (in the immediate window) did show them as equal, but the above programmatic comparisons showed not equal. Kris - it's a varchar(255) and varchar(500) respectively (title, description) – Adam Hey Sep 15 '14 at 05:48
  • Your question is very long. Try to keep it to the relevant bits, i.e. what are you trying to do and why is it not working? Reading your post I am not sure if you are having issues with database storing, rss fetching, string comparison, regex searching, unicode issues (due to `?`) or all of the above. I am thinking about closing as too broad, unclear what you are asking or lack of SSCCE. – default Sep 15 '14 at 10:40
  • 1
    is your DB field VARCHAR?? try changing it to NVARCHAR. – Pedro.The.Kid Sep 15 '14 at 14:15
  • @Default - my questions is specifically about string comparisons. I added some updates with points I found to be relevant. It's long because I wanted to ensure I covered all the bases. It's also long because I didn't know *why* it wasn't working, so I was trying to be clear about the different approaches I had tried. My first question on SO got criticised for not including enough info. Go figure... – Adam Hey Sep 16 '14 at 05:33
  • 1
    It wasn't meant as criticicing, I'm sorry if you felt like that. I do note though that there are currently no answers to your question. Your question contains multiple technologies, has 2 additional edits, 3 code sections and 2 input sections, which to me is too much info. I think that if you formulated it as "I have this string input and I want this result. This is what I tried and that did not resolve it", leaving out for instance the parts about where the input comes from you could get more answers. The other comments also relates more to database and trying to isolate your issue more. – default Sep 16 '14 at 09:56
  • No probs. I take your meaning and I agree that it probably could have been a bit more concise. @KrisVandermotten 's suggestion (regarding normalizing) has been the most successful so far. If you'd like to add that as an answer, I'll accept it – Adam Hey Sep 16 '14 at 11:31

1 Answers1

4

Unicode strings can be "binary" different, even if they are "semantically" the same.

Try normalizing your strings. For more information, see http://msdn.microsoft.com/en-us/library/System.String.Normalize.aspx

Kris Vandermotten
  • 10,111
  • 38
  • 49
  • Unfortunately - in my situation - this didn't prove to be a reliable option. See UPDATE 2 above. Thanks all the same – Adam Hey Sep 17 '14 at 09:59
  • After more testing, it appears that your first comment might be the right solution for my problem. In SSMS I ran a query which contained an "α" (alpha) and the output contained no "α", but an "a" in its place. The columns in question are varchar and changing them to nvarchar is something that might be necessary, but something I was dreading (10M+ rows...) The collation is also different between the dev db and the prod db, which doesn't help with testing/debugging. – Adam Hey Sep 17 '14 at 12:57
  • @AdamHey Well obviously, especially if you're dealing with Cyrillic characters. That was my first hunch. But since you didn't react, I assumed you had that covered already. – Kris Vandermotten Sep 17 '14 at 13:24