3

I have a dataframe that was imported into R and has around 180 columns and 1000s of records with string values. These strings are either single words or combination of words,eg("Good", "Very good"). They have special characters like apostrophe eg.("Don't know").

I am trying to find and replace specific string values in multiple columns with numbers so that i can process them.

I have stored the columns of interest into a variable so that I dont have to mention the names again and again. The names in the dataset are too long and have a lot of "." in them.

A reproducible code is provided as sample.

set.seed(12)
datwe <- data.frame(replicate(37,sample(c("ABC’o /BBB","XYZ","FoO","ABC'o /BBB",NA),10,rep=TRUE)))

> str(datwe)
'data.frame':   10 obs. of  37 variables:
 $ X1 : Factor w/ 3 levels "ABC'o /BBB","ABC’o /BBB",..: 2 NA NA 3 2 2 2 1 2 2
 $ X2 : Factor w/ 4 levels "ABC'o /BBB","ABC’o /BBB",..: 4 NA 4 4 4 3 3 3 1 2
 $ X3 : Factor w/ 4 levels "ABC'o /BBB","ABC’o /BBB",..: 4 1 2 1 4 4 3 2 3 1
 $ X4 : Factor w/ 2 levels "ABC'o /BBB","XYZ": 2 NA NA NA 1 NA 1 NA 2 2

Please note, "ABC'o /BBB","ABC’o /BBB" are not same even though they can be read the same.

Columns of interest are in v

v=c(names(datwe[3:6]),names(datwe[9]),names(datwe[12]))

I want to replace "ABC'o /BBB" with -100, "FoO" with 4 and "XYZ" with 5. These columns in v and other columns do have NA as well in the actual data which has to be kept as NA. my ultimate aim is to do some summarization on these columns, to find out what is the sum, mean,etc. Which columns have better mean than the other.

I have tried the below code and it has partially worked. I have been able to replace F0O with 4, but not "ABC'o /BBB" as the (') in the data and R does not match and it does not pick the relevant data.

datwe[v]<- replace(datwe[v],datwe[v]=="FoO","4")
datwe[v]<- replace(datwe[v],datwe[v]=="ABC'o /BBB","-100")

This has failed

datwe[grepl("^ABC",datwe[,v],perl=TRUE),datwe[,v]]<-"-100"

I tried with sqldf to do an update stmt, that too failed.

for(mycols in v)
sqldf("update datwe set $mycols='-100' where $mycols like 'AB%'")

Please help!

Thanks @amrrs and @Hugh for providing solutions. I have edited the sample data to reflect data closer to actual data.

I tried the below code as suggested by @amrrs

datwe[v] <- lapply(datwe[v],
                   function(x){ifelse(x=="FoO","4",
                                      ifelse(x=="XYZ","5",
                                             ifelse(x=="ABC'o /BBB","-100",x)))})

but "ABC'o /BBB" is converted to 2 instead of -100 where exact match could not be found.

I think the issue is due to the quote (') differing in R and source data.

The actual data has single quote in different forms (Don’t, don't).

I tried the solution provided by @Hugh with actual data

datwe_melt <-
  datwe[, id := .I] %>%
  melt.data.table(id.vars = "id")

This worked fine and created the columns as variables and correct values.

datwe_melt_modified<-datwe_melt[decoder, on = "value==old"] %>%
  dcast.data.table(id ~ variable, value.var = "new")

the dcast code gave the below warning.

Aggregate function missing, defaulting to 'length'

and 0 replaced NA and 1 replaced non-NA; and a new column named NA included. The decoder is coming from Excel-file. The NA in the sample data is handled correctly while in actual data it gives a new column and a new row. I know I have not understood dcast to troubleshoot it. Checked for the resolution given here and here but I am not able to resolve my issue.

I was hoping to have an efficient solution than to specifically mention all forms of quotes.

Is there are way to give something similar to like 'AB%' in SQL which should replace all that starts with "AB"

NN2017
  • 45
  • 1
  • 6

3 Answers3

1

Because of the quotes maybe your grepl didn't work. So ifelse should!

    set.seed(12)
datwe <- data.frame(replicate(37,sample(c("ABC'o /BBB","XYZ","FoO"),10,rep=TRUE)))

v=c(names(datwe[3:6]),names(datwe[9]),names(datwe[12]))

datwe[v] <- lapply(datwe[v],function(x){ifelse(x=="ABC'o /BBB",-100,ifelse(x=="FoO",1,ifelse(x=='XYZ',2,x)))})

datwe
amrrs
  • 6,215
  • 2
  • 18
  • 27
  • > datwe$v <- ifelse(datwe$v=="ABC'o /BBB","-100") Error in `$<-.data.frame`(`*tmp*`, v, value = logical(0)) : replacement has 0 rows, data has 10 – NN2017 Sep 22 '17 at 06:02
  • This works now! Thanks. have to check this code on the actual data and see if it works there! – NN2017 Sep 22 '17 at 06:44
  • Good, feel free to mark as answer with upvote if it works! – amrrs Sep 22 '17 at 06:45
  • Dont have enough reputation to mark it as Answered!! :( I'm new to posting in this forum though i have been using it for quite some time – NN2017 Sep 22 '17 at 06:49
  • 1
    I resolved the issue! Thanks for directing me there! – NN2017 Sep 26 '17 at 06:07
1

Hurray! I was able to by-pass the quotes issue with an equivalent of (like "ABC%") in SQL

datwe[v] <- lapply(datwe[v],
                   function(x){ifelse(x=="FoO","4",
                                      ifelse(x=="XYZ","5",
                                             ifelse(strtrim(x,3)=="ABC","-100",x)))})

Thanks @amrrs for taking me in the right direction!

Hope this helps others who are looking to resolve similar issue. I am sure there could be better methods. open to suggestions!!

NN2017
  • 45
  • 1
  • 6
0

Whenever you want to "decode" values in a table, you should think "how can I use a join here"?

This is a data.table solution. magrittr is just for the %>%. We first melt the table so that all the values are in one column. We then just decode that table using a lookup table (decoder). dcast.data.table is the opposite of melt.data.table -- the function returns the molten data.table to its original form

set.seed(12)
datwe <- data.frame(replicate(37,sample(c("ABC'o /BBB","XYZ","FoO"),10,rep=TRUE)))
library(data.table)
library(magrittr)

setDT(datwe)
datwe_melt <-
  datwe[, id := .I] %>%
  melt.data.table(id.vars = "id")

decoder <- 
  data.table(old = c("ABC'o /BBB","XYZ","FoO"),
             new = c(-100, 2, 1))

datwe_melt[decoder, on = "value==old"] %>%
dcast.data.table(id ~ variable, value.var = "new")

Note that you may be better off just using the data.table after datwe_melt[decoder, on = "value==old"] as it's often easier to run queries on tables with this structure.

Hugh
  • 15,521
  • 12
  • 57
  • 100
  • This is excellent! have to see how it turns out on the actual data. – NN2017 Sep 22 '17 at 06:45
  • on Actual data, am getting warning: Aggregate function missing, defaulting to 'length'. Not sure on how to resolve it. I have edited the problem above with the results based on your suggestion. – NN2017 Sep 26 '17 at 05:22
  • I don't get that error. Do you have more columns in your data? – Hugh Sep 26 '17 at 05:26
  • Yes, I have around 180 columns and 2 of them are primary keys. even if I take only the subset where I need to find and replace, I am not able to due to the mentioned warning which destroys the whole purpose – NN2017 Sep 26 '17 at 06:05