Comparing Strings for match in a vectorized way

Question

I have a large data frame, which contains two columns containing strings. When these columns are unequal, I want to do an operation.

The problem is that when I use a simple != operator, it gives incorrect results. I.e. apparently, 'Tout_Inclus' & 'Tout_Inclus' are unequal.

This leads me to string comparison functions, like strcmp from pracma package. However, this is not vectorised - my dataframe has 9.6M rows, therefore I think this would crash/take ages if I looped through.

Has anyone got any vectorised methods for comparing strings?

My dataframe looks like this:

    City_Break  City_Break
    City_Break  City_Break
    Court_Break Court_Break
    Petit_Budget    Petit_Budget
    Pas_Cher    Pas_Cher
    Deals   Deals_Pas_Chers
    Vacances    Vacances_Éco
    Hôtel_Vol   Hôtel_Vol
    Dernière_Minute Dernière_Minute
    Formule Formule_Éco
    Court_Séjour    Court_Séjour
    Voyage  Voyage_Pas_Cher
    Séjour  Séjour_Pas_Cher
    Congés  Congés_Éco

when I do something like df[colA != colB,] it gives incorrect results, where strings (by looking at them) are equal.

I've ensured encoding is UTF-8, strings are not factors, and I also tried removing special characters before doing the comparison.

By the way, these strings are from multiple languages.

edit: I've already trimmed whitespaces, and still no luck

Do you possibly have any leading/trailing whitespace in one/both of the columns? Your `df[colA != colB,]` is correct, and should have worked. — Tim Biegeleisen, Aug 08 '18 at 11:29
Why do you assume that `strcmp` will give the desired results when `!=` doesn’t? — Konrad Rudolph, Aug 08 '18 at 11:30
While your example `'Tout_Inclus'` would not be subject to this, to be general, there's also a possibility that accents were encoded differently on each side (resulting in same display but different character values), or that some non standard spaces were used on each side. I know too well that French characters are a pain to work with... — moodymudskipper, Aug 08 '18 at 11:39
u can put those strings into two new vectors to compare in a loop, with fuzzy string matching algorithms, u can check, https://stackoverflow.com/questions/47271685/fuzzy-matching-in-r and https://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r — İlker İlter, Aug 08 '18 at 11:41

score 1 · Answer 1 · answered Aug 08 '18 at 11:32

1

Try removing leading/trailing whitespace from both columns, and then compare:

df[trimws(df$colA, "both") != trimws(df$colB, "both"), ]

answered Aug 08 '18 at 11:32

Tim Biegeleisen

502,043
27
286
360

For a quick visual check one can also do `print(df, quote= TRUE)` – moodymudskipper Aug 08 '18 at 11:34
1

@Muddy_Moodskipper Thanks for the tip. – Tim Biegeleisen Aug 08 '18 at 11:35
@TimBiegeleisen - thanks for the suggestion, but still no luck – Tim496 Aug 08 '18 at 11:42
1

@Tim496 OK...I will delete this once you get a correct answer then. It might help others to leave it up for now. Nice first name, by the way! – Tim Biegeleisen Aug 08 '18 at 11:43

score 0 · Answer 2 · answered Aug 08 '18 at 13:09

If evertyhing else is fine(trim, etc..), yours could be an encoding problem. In UTF-8 the same accented character could be rapresented with different byte sequences. It may be single byte coded or with modifier byte. However, very strange with 'Tout_Inclus'.
Just to have a check, from stringi package try this:

stringi::stri_compare(df$colA,df$colB, "fr_FR")

What's the output?

Comparing Strings for match in a vectorized way

2 Answers2