1

I have a large data set with column names: ID and Property. There may be several rows sharing the same ID, which means that one ID has many different properties (categorical variable). I want to add dummy variables for property and finally get a data frame with distinct ID in each row, and indicate whether it has that property using 1/0. The original data has 2 million rows and 10000 distinct properties. So, ideally, I will shrink the row size by combining same IDs and add dummy variable columns (1 column for each property).

R crashes when I use the following code:

for(t in unique(df$property)){
df3[paste("property",t,sep="")] <- ifelse(df$property==t,1,0)

}

So I am wondering what's the most efficient way to add dummy variable columns for large data set in R?

Sheldon
  • 315
  • 2
  • 5
  • 13
  • Generally there is no need to construct dummy variables in R. The factor class is more appropriate for categorical variables. I seriously doubt that R "crashes". I suspect you got some sort of error that you are not sharing. I suspect you wanted to use "[[" rather than "[". Downvote is for failing to provide `[MCVE]` – IRTFM Feb 01 '17 at 04:54
  • @42- I test the code on a small subset of the whole data, it works. So I assume it is a problem of large data. The size is around 100 MB – Sheldon Feb 01 '17 at 04:57
  • 100MB is not particularly large (depending on how you are measuring "data"). – IRTFM Feb 01 '17 at 04:57
  • I agree. But considering that we have thousands of dummy variables, this would be a huge sparse matrix, I guess that might be the reason of R failure. Is there any method to handle this? – Sheldon Feb 01 '17 at 05:22
  • Don't use dummies. Use factors. – IRTFM Feb 01 '17 at 15:11

1 Answers1

1

We can just use table

as.data.frame.matrix(table(df1))
#  A B C D
#1 1 1 0 0
#3 0 0 1 0
#4 0 0 0 1
#5 0 0 0 2

Or an efficient approach would be dcast from data.table

library(data.table)
dcast(setDT(df1), a~b, value.var = "a", length)

data

df1 <- structure(list(a = c(1L, 1L, 3L, 4L, 5L, 5L), b = c("A", "B", 
"C", "D", "D", "D")), .Names = c("a", "b"), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This works for the full data set! Thanks! – Sheldon Feb 01 '17 at 15:28
  • hi, what if I not only want to have binary outcome (1 or 0), but also what to record the number of those columns for which A has? – Sheldon Feb 07 '17 at 22:09
  • @Sheldon In the output, there is only a single column for 'A'. It is not clear what you wanted – akrun Feb 08 '17 at 03:09
  • For example, assume I have three columns in the original data frame: ID, attributes, values. I have two IDs. ID 1 has some value for attributes A and B, while ID 2 has some value for attributes B and C. – Sheldon Feb 08 '17 at 03:15
  • Say I want to convert `data.frame(ID=c(1,1,2,2),att=c('a','b','b','c'),values=c(1,2,3,4))` to `data.frame(ID=c(1,2),a=c(1,'NA'),b=c(2,3),c=c('NA',4))` – Sheldon Feb 08 '17 at 03:16
  • @Sheldon If your first data is `df1`, `library(reshape2); dcast(df1, ID~att, value.var = "values")` – akrun Feb 08 '17 at 03:20
  • I simply replace your 'a' by ID and 'b' by attributes in my case, and it works in the sense that it converts to a data frame describing whether a ID has an attribute or not (binary). But now I am wondering whether we can record the attributes value for that ID instead of just writing 1 or 0. – Sheldon Feb 08 '17 at 03:22
  • @Sheldon I think I answered your previous comment with `dcast`. – akrun Feb 08 '17 at 03:23
  • I used `dcast(df1, ID~att, value.var = "values")` and it reports: Aggregation function missing: defaulting to length. Also, the values I got is still 1 or 0 depending on whether ID has this attributes, not the correct continuous value in the old df. – Sheldon Feb 08 '17 at 03:35
  • @Sheldon It is based on the example you showed. If it is having dupe elements, `dcast(setDT(df1), ID+rowid(ID)~att, value.var = "values")` it is not clear without a reproducible example `library(data.table)` – akrun Feb 08 '17 at 03:39
  • Do you know how to limit the columns of results to be the specified names. Say there are 100 potential string values for "b", but I am only interested in a few values: names=("b1", "b2"). I want to make sure that the results have column names as: a, b1, b2 instead of a, b1, b2, b3, ... and all other potential values for "b" – Sheldon Apr 17 '17 at 18:33
  • @Sheldon IN that case you can subset the 'b' column first.. `subset(df1, b %in% c("A", "B"))` and then do the table on `dcast` on it – akrun Apr 18 '17 at 04:01