0

Suppose I have a dataframe:

    sick <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
    code <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
    sick_code <- data.frame(sick, code)  

And another:

    pid <- abs(round(rnorm(6)*1000,0))
    sick <- c("-" , "-", "-", "-", "daa16", "SO")
    p_sick <- data.frame(pid, sick)

Now i would like to add a new varialbe to p_sick, that "translates" p_sick$sick to sick_code$code. The variable in p_sick$sick is a string which may or may not be p_sick$sick in this case NA should be returned.

Now I could write for loop with a simple ifelse statement. But the data I have is 150million rows long, and the translate table is 15.000 long.

I have googled that this is the equalivalent of a "proc format" in SaS (but I do not have acces to SaS, nor do I have any idea how it works).

Perhaps some variant of merge in plyr, or an apply function?

EDIT: I have accepted both answer, since they work. I will try and look into the difference (in speed) between the two. Since merge is a built in function I am guessing it does lots of checking.

EDIT2: To people getting here by Google; merge has and sort = FALSE which will speed things up. Note that the order is not preserved in any way.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Repmat
  • 690
  • 6
  • 19
  • a vectorized solution would be `key <- setNames(code, sick); key[sick]`, no need for merge or anything. although with 150M rows, it would still probably be slow – rawr Mar 09 '15 at 19:26

3 Answers3

2

data.table will be suitable in your example:

   library(data.table)
   setkey(setDT(p_sick),sick)
   p_sick[setDT(sick_code),code := i.code][]
        pid  sick   code
    1: 3137     -     NA
    2:  755     -     NA
    3: 1327     -     NA
    4:  929     -     NA
    5:  939 daa16 cancer
    6:  906    SO     NA

Please see here for detail explanation.

Community
  • 1
  • 1
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • I will try to test the difference between data.table and merge, since merge is very slow – Repmat Mar 09 '15 at 18:16
  • Yes, `data.table` is fast compared to `merge`. – Metrics Mar 09 '15 at 18:27
  • 1
    I'd suggest using `setDT()` which converts data.frames to data.tables by reference. If the object is large enough, the conversion to data.tables using `data.table()` alone will take twice the memory and be slow (due to copying), which could make using data.tables pointless... – Arun Mar 09 '15 at 18:33
1

You could use merge with all.x = TRUE (to keep values from p_sick with no match in sick_code:

merge(p_sick, sick_code, all.x = TRUE)

An equivalent is using left_join from dplyr:

library(dplyr)
left_join(p_sick, sick_code)
#    pid  sick   code
# 1  212     -   <NA>
# 2 2366     -   <NA>
# 3  325     -   <NA>
# 4  269     -   <NA>
# 5  501 daa16 cancer
# 6 1352    SO   <NA>

Note that each of these solutions works only because the name sick is shared between the two data frames. Suppose they had different names- say the column was called sickness in sick_code. You could accommodate this with, respectively:

merge(p_sick, sick_code, by.x = "sick", by.y = "sickness", all.x = TRUE)
# or
left_join(p_sick, sick_code, c(sick = "sickness"))
David Robinson
  • 77,383
  • 16
  • 167
  • 187
0

A simple named vector will also work. The named vector can act as a lookup. So instead of defining sick and code as a data frame, define it as a named vector and use it as a decode. Like this:

# Set up named vector
sick_decode <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
names(sick_decode) <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")

# Prepare data  
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
  
# Create new variable using decode
p_sick$sick_decode <- sick_decode[p_sick$sick]

# Results
#>    pid  sick sick_decode
#> 1  511     -        <NA>
#> 2 1619     -        <NA>
#> 3  394     -        <NA>
#> 4  641     -        <NA>
#> 5   53 daa16      cancer
#> 6  244    SO        <NA>

I suspect this method will also be fast, but have not benchmarked it.

Also, there is now an R package specifically for replicating SAS format functionality in R. It is called fmtr.

David J. Bosak
  • 1,386
  • 12
  • 22