2

I have a big data frame main_df with company_names and several variables. Some of the company_names are misspelled, have typos, or need to be changed otherwise. Therefore, I am creating a vector of unique names, using:

unique_names <- unique(levels(as.factor(main_df$company_name)))

This gives me a vector that looks something like this when seen from the view window view(unique_names):

V1:
Cosmonize Bulgaria Inc.
Crown One Foundation
Institut f�r Luft-und Raumfahrttechnik

Suppose, for instance, that Crown One Foundation changed its name to Crown Two Foundation. In this case, I would hard code the change in main_df for all instances:

main_df$company_name[which(main_df$company_name == "Crown One Foundation")] <- "Crown Two Foundation"

This approach has worked well for all entries except the ones that show a replacement character, like Institut f�r Luft-und Raumfahrttechnik.

I've tried copying the entry from the view window:

main_df$company_name[which(main_df$company_name == "Institut f�r Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik"

I've also tried to slice out the appropriate cell and used the result: unique_names[100]:

main_df$company_name[which(main_df$company_name == "Institut f\xfcr Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik"

Neither approach worked. When I refresh unique_names <- unique(levels(as.factor(main_df$company_name))) nothing changes. Interestingly, when I search for Institute in the search window of the view window, the one in question does not appear.

Another idea I had was to work with Encoded. I used Encoding(unique_names[100] to find that it is UTF-8. Using Encoding(unique_names[100] <- 'latin1' changed the entry in the view window to Institut für Luft-und Raumfahrttechnik.

However, upon refreshing the unique entries using unique_names <- unique(levels(as.factor(main_df$company_name))), the entry is not updated.

Even then, main_df$company_name[which(main_df$company_name == "Institut für Luft-und Raumfahrttechnik")] <- "Institut fur Luft-und Raumfahrttechnik" doesn't lead to a change either (removing the umlaut here).

Am I looking at this the wrong way? I know there is a lot of hard coding and I've changed all entries besides the ones with the replacement character. Therefore, I don't want to change the Encoded properties for the entire vector but rather change these few dozen entries manually.

Thanks a lot in advance. I don't have a package preference and would appreciate any help.

Edit: Upon request, here is the part of the output for dput(unique_names):

c("Aalborg University", "Aalto University", "Aarhus University", "ACDVE", "Aero LLC", "AgilitySpaceCorp", "Air Force Research Laboratory (AFRL), "Airbus")

Here is dput(head(main_df$company_name)):

c("Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University", "Aalborg University")
questionmark
  • 335
  • 1
  • 13

2 Answers2

1

EDIT:

Have you tried substituting the one character in question using gsuband regular expression AND converting to character type?

Data:

df <- data.frame(
  Name = c("Institut f�r Luft-und Raumfahrttechnik", "Aarhus University", "ACDVE", "Aero LLC", "AgilitySpaceCorp", "Air Force Research Laboratory (AFRL)", "Airbus"))

Solution:

gsub("�","ü",as.character(df$Name))

Result:

[1] "Institut für Luft-und Raumfahrttechnik" "Aarhus University"                      "ACDVE"                                 
[4] "Aero LLC"                               "AgilitySpaceCorp"                       "Air Force Research Laboratory (AFRL)"  
[7] "Airbus" 

My hunch is that, if you have multiple such special cases, you should convert the whole names set to character:

df$Name <- as.character(df$Name)

This would enable you to search the dataframe for cases where you have non-ASCII characters using this regex:

df[grepl("[^ -~]", df$Name),]
[1] "Institut f�r Luft-und Raumfahrttechnik"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
-1

the data was probably imported using incompatible codesets. eg. reading ISO-8859-5(Cyrillic) texts using us-ascii.

If you can re-import the original dataset, that will probably give you a cleaner thus better dataset to work in future.

if you need to work with what you have, I found this link to be a great starting point: How to identify/delete non-UTF-8 characters in R

taiyodayo
  • 331
  • 4
  • 13