I have a dataframe that was imported into R and has around 180 columns and 1000s of records with string values. These strings are either single words or combination of words,eg("Good", "Very good"). They have special characters like apostrophe eg.("Don't know").
I am trying to find and replace specific string values in multiple columns with numbers so that i can process them.
I have stored the columns of interest into a variable so that I dont have to mention the names again and again. The names in the dataset are too long and have a lot of "." in them.
A reproducible code is provided as sample.
set.seed(12)
datwe <- data.frame(replicate(37,sample(c("ABC’o /BBB","XYZ","FoO","ABC'o /BBB",NA),10,rep=TRUE)))
> str(datwe)
'data.frame': 10 obs. of 37 variables:
$ X1 : Factor w/ 3 levels "ABC'o /BBB","ABC’o /BBB",..: 2 NA NA 3 2 2 2 1 2 2
$ X2 : Factor w/ 4 levels "ABC'o /BBB","ABC’o /BBB",..: 4 NA 4 4 4 3 3 3 1 2
$ X3 : Factor w/ 4 levels "ABC'o /BBB","ABC’o /BBB",..: 4 1 2 1 4 4 3 2 3 1
$ X4 : Factor w/ 2 levels "ABC'o /BBB","XYZ": 2 NA NA NA 1 NA 1 NA 2 2
Please note, "ABC'o /BBB","ABC’o /BBB" are not same even though they can be read the same.
Columns of interest are in v
v=c(names(datwe[3:6]),names(datwe[9]),names(datwe[12]))
I want to replace "ABC'o /BBB" with -100, "FoO" with 4 and "XYZ" with 5. These columns in v and other columns do have NA as well in the actual data which has to be kept as NA. my ultimate aim is to do some summarization on these columns, to find out what is the sum, mean,etc. Which columns have better mean than the other.
I have tried the below code and it has partially worked. I have been able to replace F0O with 4, but not "ABC'o /BBB" as the (') in the data and R does not match and it does not pick the relevant data.
datwe[v]<- replace(datwe[v],datwe[v]=="FoO","4")
datwe[v]<- replace(datwe[v],datwe[v]=="ABC'o /BBB","-100")
This has failed
datwe[grepl("^ABC",datwe[,v],perl=TRUE),datwe[,v]]<-"-100"
I tried with sqldf to do an update stmt, that too failed.
for(mycols in v)
sqldf("update datwe set $mycols='-100' where $mycols like 'AB%'")
Please help!
Thanks @amrrs and @Hugh for providing solutions. I have edited the sample data to reflect data closer to actual data.
I tried the below code as suggested by @amrrs
datwe[v] <- lapply(datwe[v],
function(x){ifelse(x=="FoO","4",
ifelse(x=="XYZ","5",
ifelse(x=="ABC'o /BBB","-100",x)))})
but "ABC'o /BBB" is converted to 2 instead of -100 where exact match could not be found.
I think the issue is due to the quote (') differing in R and source data.
The actual data has single quote in different forms (Don’t, don't).
I tried the solution provided by @Hugh with actual data
datwe_melt <-
datwe[, id := .I] %>%
melt.data.table(id.vars = "id")
This worked fine and created the columns as variables and correct values.
datwe_melt_modified<-datwe_melt[decoder, on = "value==old"] %>%
dcast.data.table(id ~ variable, value.var = "new")
the dcast code gave the below warning.
Aggregate function missing, defaulting to 'length'
and 0 replaced NA and 1 replaced non-NA; and a new column named NA included. The decoder is coming from Excel-file. The NA in the sample data is handled correctly while in actual data it gives a new column and a new row. I know I have not understood dcast to troubleshoot it. Checked for the resolution given here and here but I am not able to resolve my issue.
I was hoping to have an efficient solution than to specifically mention all forms of quotes.
Is there are way to give something similar to like 'AB%' in SQL which should replace all that starts with "AB"