I have a large data frame, which contains two columns containing strings. When these columns are unequal, I want to do an operation.
The problem is that when I use a simple !=
operator, it gives incorrect results. I.e. apparently, 'Tout_Inclus'
& 'Tout_Inclus'
are unequal.
This leads me to string comparison functions, like strcmp
from pracma package. However, this is not vectorised - my dataframe has 9.6M rows, therefore I think this would crash/take ages if I looped through.
Has anyone got any vectorised methods for comparing strings?
My dataframe looks like this:
City_Break City_Break
City_Break City_Break
Court_Break Court_Break
Petit_Budget Petit_Budget
Pas_Cher Pas_Cher
Deals Deals_Pas_Chers
Vacances Vacances_Éco
Hôtel_Vol Hôtel_Vol
Dernière_Minute Dernière_Minute
Formule Formule_Éco
Court_Séjour Court_Séjour
Voyage Voyage_Pas_Cher
Séjour Séjour_Pas_Cher
Congés Congés_Éco
when I do something like df[colA != colB,]
it gives incorrect results, where strings (by looking at them) are equal.
I've ensured encoding is UTF-8
, strings are not factors, and I also tried removing special characters before doing the comparison.
By the way, these strings are from multiple languages.
edit: I've already trimmed whitespaces, and still no luck