1

I have a large data frame, which contains two columns containing strings. When these columns are unequal, I want to do an operation.

The problem is that when I use a simple != operator, it gives incorrect results. I.e. apparently, 'Tout_Inclus' & 'Tout_Inclus' are unequal.

This leads me to string comparison functions, like strcmp from pracma package. However, this is not vectorised - my dataframe has 9.6M rows, therefore I think this would crash/take ages if I looped through.

Has anyone got any vectorised methods for comparing strings?

My dataframe looks like this:

    City_Break  City_Break
    City_Break  City_Break
    Court_Break Court_Break
    Petit_Budget    Petit_Budget
    Pas_Cher    Pas_Cher
    Deals   Deals_Pas_Chers
    Vacances    Vacances_Éco
    Hôtel_Vol   Hôtel_Vol
    Dernière_Minute Dernière_Minute
    Formule Formule_Éco
    Court_Séjour    Court_Séjour
    Voyage  Voyage_Pas_Cher
    Séjour  Séjour_Pas_Cher
    Congés  Congés_Éco

when I do something like df[colA != colB,] it gives incorrect results, where strings (by looking at them) are equal.

I've ensured encoding is UTF-8, strings are not factors, and I also tried removing special characters before doing the comparison.

By the way, these strings are from multiple languages.

edit: I've already trimmed whitespaces, and still no luck

N8888
  • 670
  • 2
  • 14
  • 20
Tim496
  • 162
  • 3
  • 19
  • Do you possibly have any leading/trailing whitespace in one/both of the columns? Your `df[colA != colB,]` is correct, and should have worked. – Tim Biegeleisen Aug 08 '18 at 11:29
  • Why do you assume that `strcmp` will give the desired results when `!=` doesn’t? – Konrad Rudolph Aug 08 '18 at 11:30
  • 1
    While your example `'Tout_Inclus'` would not be subject to this, to be general, there's also a possibility that accents were encoded differently on each side (resulting in same display but different character values), or that some non standard spaces were used on each side. I know too well that French characters are a pain to work with... – moodymudskipper Aug 08 '18 at 11:39
  • u can put those strings into two new vectors to compare in a loop, with fuzzy string matching algorithms, u can check, https://stackoverflow.com/questions/47271685/fuzzy-matching-in-r and https://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r – İlker İlter Aug 08 '18 at 11:41

2 Answers2

1

Try removing leading/trailing whitespace from both columns, and then compare:

df[trimws(df$colA, "both") != trimws(df$colB, "both"), ]
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

If evertyhing else is fine(trim, etc..), yours could be an encoding problem. In UTF-8 the same accented character could be rapresented with different byte sequences. It may be single byte coded or with modifier byte. However, very strange with 'Tout_Inclus'.
Just to have a check, from stringi package try this:

stringi::stri_compare(df$colA,df$colB, "fr_FR")

What's the output?

AleBdC
  • 81
  • 5