1

The problem:

Let us consider a dataframe df:

df <- structure(list(id = 1:4, var1 = c("blissard", "Blizzard", "storm of snow", 
"DUST DEVIL/BLIZZARD")), .Names = c("id", "var1"), class = "data.frame", row.names = c(NA, 
-4L))

> df

id  var1   
1   "blissard"
2   "Blizzard"
3   "storm of snow"
4   "DUST DEVIL/BLIZZARD"

> class(dt$var1)
[1] "character"

I would like to make it tidy and pretty, hence I try to recode var1, that possesses four different entries in a more gracious and analysable va1_recoded, hence:

df$var1_recoded[grepl("[Bb][Ll][Ii]", df$var1)] <- "blizzard"
df$var1_recoded[grepl("[Ss][Tt][Oo]", df$var1)] <- "storm"

id  var1                  var1_recoded   
1   "blissard"            "blizzard"  
2   "Blizzard"            "blizzard"
3   "storm of snow"       "storm"
4   "DUST DEVIL/BLIZZARD" "blizzard"

The question:

How can I create a function that automates the process described by the two functions above? In different words: how would that be generalizable to (lets say) 1000 replacements?

I would input the function with a list (such as c("storm", "blizzard")) and then make it apply the process of matching and replacing the observations that respect the condition.

I found a precious contribute here: Replace multiple arguments with gsub but I am not able to programmatically translate the function described above in the R language. Especially, I cannot create the condition allowing grep to recognize the first three letters of the word to match.

oguz ismail
  • 1
  • 16
  • 47
  • 69
Worice
  • 3,847
  • 3
  • 28
  • 49
  • 3
    there is no question in your "the question" – rawr Dec 22 '15 at 23:21
  • @rawr I apologize for not being more straightforward: how can I create a function that automatize the process described by the two functions above? – Worice Dec 22 '15 at 23:49
  • 1
    Can't you just put the above two lines of code into a function? – ytk Dec 23 '15 at 00:34
  • @Teka K how would that be generalizable to 1000 replacements? To some extent I feel like the comments on this question are giving a brand new SO user a hard time when they've asked a question, provided data + code and a desired output. Not bad for someone with a rep of 11. – Tyler Rinker Dec 23 '15 at 00:41
  • @TylerRinker my apologies, I didn't intend to sound rude. Seeing the answers now, I realize that I didn't understand the question correctly. – ytk Dec 23 '15 at 00:50
  • @TejaK Gotcha seems a misunderstanding on my part as well. I'm protective of new SO (and possibly new R) users because so many people were patient with me when I began and SO was a safe place to ask questions. – Tyler Rinker Dec 23 '15 at 00:52
  • I would like to thank you all for helping me in making my question more elegant in presenting the problem. Especially, thank you ___rarw___ for pointing the problem with the question. Thank you @TylerRinker for making the question itself straightforward and for your patience. I edit the post with those suggestions. – Worice Dec 23 '15 at 09:26

2 Answers2

1

Here's one possible approach:

The data

dat <- read.csv(text="id,  var1  
1,   blissard
2,   Blizzard
3,   storm of snow
4,   hurricane
5,   DUST DEVIL/BLIZZARD", header=T, stringsAsFactors = FALSE, strip.white=T)

x <- c("storm", "blizzard")

Solution

if (!require("pacman")) install.packages("pacman")
pacman::p_load(stringdist, stringi)

dat[["var1_recoded"]] <- NA
tol <- .6

for (i in seq_len(nrow(dat))) {
    potentials <- unlist(stri_extract_all_words(dat[["var1"]][i]))
    y <- stringdistmatrix(tolower(potentials), tolower(x), method = "jaccard") 
    if (min(y) > tol) {
        dat[["var1_recoded"]][i] <- dat[["var1"]][i]
    } else {
        dat[["var1_recoded"]][i] <- x[which(y == min(y), arr.ind = TRUE)[2]]
    }
}

##   id                var1 var1_recoded
## 1  1            blissard     blizzard
## 2  2            Blizzard     blizzard
## 3  3       storm of snow        storm
## 4  4           hurricane    hurricane
## 5  5 DUST DEVIL/BLIZZARD     blizzard

Edit incorporated @mra68's data in solution

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • TylerRinker thank you for the answer! It works, but, in my newbie opinion, the solution proposed by mra68 is easier. More "pythonic", perhaps. What is the difference that you consider more relevant? Is an approach better suited for a particular operation/context rather than others? – Worice Dec 23 '15 at 11:07
  • 1
    It doesn't assume correctly spelled first three letters. So `Vlizzard conditions` would be turned to `blizzard`. – Tyler Rinker Dec 23 '15 at 13:03
  • If I get it correctly, this procedure is more powerful. It could deal even with dataframe biased by many data entry typos. Thank you, now I try to stress it a little. – Worice Dec 23 '15 at 14:31
1
f <- function( x )
{
  A <- c( "blizzard", "storm" )
  A3 <- sapply(A,substr,1,3)
  x <- as.character(x)
  n <- max( c( 0, which( sapply( A3, grepl, tolower(x) ) ) ) )

  if ( n==0 )
  {
    warning( "nothing found")
    return (x)
  }

  A[n]
}

df <- data.frame( id = 1:5,
                  var1 = c( "blissard", "Blizzard", "storm of snow", "DUST DEVIL/BLIZZARD", "hurricane" ) )

If neiher "blizzard" nor "storm" matches, "var1" is left as is (with a warning). "hurricane" is an example.

> df$var1_recoded <- sapply(df$var1,f)
Warning message:
In FUN(X[[i]], ...) : nothing found
> df
  id                var1 var1_recoded
1  1            blissard     blizzard
2  2            Blizzard     blizzard
3  3       storm of snow        storm
4  4 DUST DEVIL/BLIZZARD     blizzard
5  5           hurricane    hurricane
> 
mra68
  • 2,960
  • 1
  • 10
  • 17
  • this solution works perfectly. It deals even with a conflicting observation, containing both `blizzard/storm`. In this case, it will report in `df$recoded` the last element that matches of the list `A <- c( "blizzard", "storm" )`. – Worice Dec 23 '15 at 10:36