R: How to deal with replacement character � that doesn't want to disappear

Question

I have a big data frame main_df with company_names and several variables. Some of the company_names are misspelled, have typos, or need to be changed otherwise. Therefore, I am creating a vector of unique names, using:

unique_names <- unique(levels(as.factor(main_df$company_name)))

This gives me a vector that looks something like this when seen from the view window view(unique_names):

V1:
Cosmonize Bulgaria Inc.
Crown One Foundation
Institut f�r Luft-und Raumfahrttechnik

Suppose, for instance, that Crown One Foundation changed its name to Crown Two Foundation. In this case, I would hard code the change in main_df for all instances:

main_df$company_name[which(main_df$company_name == "Crown One Foundation")] <- "Crown Two Foundation"

This approach has worked well for all entries except the ones that show a replacement character, like Institut f�r Luft-und Raumfahrttechnik.

I've tried copying the entry from the view window:

main_df$company_name[which(main_df$company_name == "Institut f�r Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik"

I've also tried to slice out the appropriate cell and used the result: unique_names[100]:

main_df$company_name[which(main_df$company_name == "Institut f\xfcr Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik"

Neither approach worked. When I refresh unique_names <- unique(levels(as.factor(main_df$company_name))) nothing changes. Interestingly, when I search for Institute in the search window of the view window, the one in question does not appear.

Another idea I had was to work with Encoded. I used Encoding(unique_names[100] to find that it is UTF-8. Using Encoding(unique_names[100] <- 'latin1' changed the entry in the view window to Institut für Luft-und Raumfahrttechnik.

However, upon refreshing the unique entries using unique_names <- unique(levels(as.factor(main_df$company_name))), the entry is not updated.

Even then, main_df$company_name[which(main_df$company_name == "Institut für Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik" doesn't lead to a change either (removing the umlaut here).

Am I looking at this the wrong way? I know there is a lot of hard coding and I've changed all entries besides the ones with the replacement character. Therefore, I don't want to change the Encoded properties for the entire vector but rather change these few dozen entries manually.

Thanks a lot in advance. I don't have a package preference and would appreciate any help.

Edit: Upon request, here is the part of the output for dput(unique_names):

c("Aalborg University", "Aalto University", "Aarhus University", "ACDVE", "Aero LLC", "AgilitySpaceCorp", "Air Force Research Laboratory (AFRL), "Airbus")

Here is dput(head(main_df$company_name)):

c("Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University")

Thank you for your insights @27ϕ9! I've added the requested lines to the question. — questionmark, Jun 24 '20 at 04:35
Update: I managed to solve the issue by using a combination of startsWith and endsWith. Cumbersome but effective cross-platform. — questionmark, Jul 04 '20 at 18:48

Chris Ruehlemann · Answer 1 · 2020-06-24T07:29:45.313

EDIT:

Have you tried substituting the one character in question using gsuband regular expression AND converting to character type?

Data:

df <- data.frame(
  Name = c("Institut f�r Luft-und Raumfahrttechnik", "Aarhus University", "ACDVE", "Aero LLC", "AgilitySpaceCorp", "Air Force Research Laboratory (AFRL)", "Airbus"))

Solution:

gsub("�","ü",as.character(df$Name))

Result:

[1] "Institut für Luft-und Raumfahrttechnik" "Aarhus University"                      "ACDVE"                                 
[4] "Aero LLC"                               "AgilitySpaceCorp"                       "Air Force Research Laboratory (AFRL)"  
[7] "Airbus"

My hunch is that, if you have multiple such special cases, you should convert the whole names set to character:

df$Name <- as.character(df$Name)

This would enable you to search the dataframe for cases where you have non-ASCII characters using this regex:

df[grepl("[^ -~]", df$Name),]
[1] "Institut f�r Luft-und Raumfahrttechnik"

Updated my answer. Much of the solution has to do with type conversion to character! — Chris Ruehlemann, Jun 24 '20 at 07:15
If it helped answer your question, please consider accepting my answer. — Chris Ruehlemann, Jun 24 '20 at 20:35

score -1 · Answer 2 · answered Jun 24 '20 at 03:27

the data was probably imported using incompatible codesets. eg. reading ISO-8859-5(Cyrillic) texts using us-ascii.

If you can re-import the original dataset, that will probably give you a cleaner thus better dataset to work in future.

if you need to work with what you have, I found this link to be a great starting point: How to identify/delete non-UTF-8 characters in R

R: How to deal with replacement character � that doesn't want to disappear

2 Answers2