0

I have two (example) data.frames (df1, df2)

#df1
L <- LETTERS[1:4]
b <- sample(L, 20, replace = TRUE)
df1 <- data.frame(stuff1 = 1, stuff2 = 1:10, b = b, c= NA, stringsAsFactors=FALSE)

#df2
a <- c(10,20,30,40)
df2 <- data.frame(xx = L, yy = a, stringsAsFactors=FALSE )

i want to have a new column, let's say c, in df1 based on the values out of df2. One example: A has the corresponding value of 10 (see df2) so for every A in column b of df1 should be 10 written down in the (new) line c. And this for every row of xx in df2, so in this case A,B,C and D. My code is not really working and is based only for a single value, here A:

##copy column b now it is c
df1["c"] <- df1$b


# which value has A in df2?
zz <- df2[df2$xx == "A",]
xy <- zz$yy


# fill in the new value in c (this is not working)
df1[df1$c == "A", ] <- xy 

i hope it is clear what i want to say... oh and i have some big data this is only an example to try out...

MOMO
  • 57
  • 5

1 Answers1

1

It sounds like you just want to do a merge/join. First, let's drop the empty c in df1 and change the column names a bit:

 df1 <- df1[, !names(df1) %in% "c"]
 colnames(df2) <- c("b", "c")

With just base R, we can use merge:

 df3 <- merge(df1, df2, by="b", type="left")
 head(df3)

  b stuff1 stuff2  c
1 A      1      1 10
2 A      1      2 10
3 A      1      3 10
4 A      1      3 10
5 A      1     10 10
6 A      1      7 10

The package plyr has an alternative that might be faster and will preserve column order:

library(plyr)
df4 <- join(df1, df2, by="b", type="left")
head(df4)

  stuff1 stuff2 b  c
1      1      1 A 10
2      1      2 A 10
3      1      3 A 10
4      1      4 B 20
5      1      5 B 20
6      1      6 B 20

I don't know how well that will scale with the size of your data, but if it doesn't, you could try data.table or sqldf. I haven't used these two so I can't speak much to them, but here's a comparison of their speed that might be a good starting point.

Community
  • 1
  • 1
andybega
  • 1,387
  • 12
  • 19
  • hey, thanks. it works fine with my data, i am using `join`and it is fast enough for me. Please can you say a little bit for the first row of your code? – MOMO Apr 29 '14 at 19:28
  • It's just to drop the `c` column: `%in%` does set comparison, e.g. try `c(1, 2, 3) %in% 1`. More here http://stackoverflow.com/questions/4605206/drop-columns-r-data-frame – andybega Apr 29 '14 at 21:14