Delete data based on string match - R

Question

I am programming in R to work on Csv and data manipulation I am trying to insert nulls if there is a match to a string in my csv.

My CSV is as follows:

    first_name  last _name zip_code
    Ben         Smith      12345
    Blank       Johnson    23456
    Carrie      No         34567

The list of bad_names that I would like to look through my csv is bad_names <- c("blank", "no","bad", "old")

Once I loop through my csv looking for the bad_name string matches, I want the output to be

    first_name  last _name zip_code
    Ben         Smith      12345
               Johnson    23456
    Carrie                 34567

So it doesn't delete the entire row, but just that matched. I am struggling with deleting just the entry, not the entire row, and looping through the entire list of bad_names.

Thank you for any help you can offer!

You may need to use 'tolower()' on the first_name column if you have case sensitivity problem. — Gopala, Jan 04 '16 at 16:24
Actually, if you have factors there, the above offered method won't work. Not to mention this is only for one column. Maybe add a `dput` — David Arenburg, Jan 04 '16 at 16:33
It probably makes more sense to use `NA` than an empty string, and likely `stringsAsFactors = FALSE` in `read.csv()`. — alistaire, Jan 04 '16 at 20:26

score 2 · Accepted Answer · edited May 23 '17 at 10:28

Another option with regex match:

With this data (youre example has an error in the last _name header):

data<-read.table(text="first_name last_name zip_code
Ben         Smith      12345
Blank       Johnson    23456
Carrie      No         34567",header=TRUE)

Note: I didn't use stringAsFactors=FALSE to show how I manage it if it's a factor, if it's not, get rid of the sapply call

bad_names <- c("blank", "no","bad", "old")
pat=paste0("(?i)\\b",paste0(bad_names,collapse="\\b|\\b"),"\\b")
t<-sapply(data,as.character)
gsub(pat,'',t)

I do the transition from factor to character with sapply, quick and dirty as it convert all to character, there's better approach.

The trick here is the regex construction using paste0, we create an alternation of the bad_words (separated by |) and surround them with \\b to be sure it's the whole word being matched an not just part of any word.

Then we globally substitute (gsub) any match by nothing.

Which gives:

     first_name last_name zip_code
[1,] "Ben"      "Smith"   "12345" 
[2,] ""         "Johnson" "23456" 
[3,] "Carrie"   ""        "34567"

This works as is as the whole data.frame is of class character, if you wish to mix them you'll have to do it a little differently (not copying again the pattern construction here):

f<-sapply(data,is.character)
data[,f]<-lapply(data[,f],gsub,pattern=pat,replacement='')

The idea is to find the column being character and apply gsub on their values to replace with empty on match.

That is perfect @Tensibai! Thank you so much for your help! – Maddie Jan 05 '16 at 19:58 — Maddie, Jan 05 '16 at 19:58

Delete data based on string match - R

1 Answers1