How to add dummy variables in R for a large data set

Question

I have a large data set with column names: ID and Property. There may be several rows sharing the same ID, which means that one ID has many different properties (categorical variable). I want to add dummy variables for property and finally get a data frame with distinct ID in each row, and indicate whether it has that property using 1/0. The original data has 2 million rows and 10000 distinct properties. So, ideally, I will shrink the row size by combining same IDs and add dummy variable columns (1 column for each property).

R crashes when I use the following code:

for(t in unique(df$property)){
df3[paste("property",t,sep="")] <- ifelse(df$property==t,1,0)

}

So I am wondering what's the most efficient way to add dummy variable columns for large data set in R?

Generally there is no need to construct dummy variables in R. The factor class is more appropriate for categorical variables. I seriously doubt that R "crashes". I suspect you got some sort of error that you are not sharing. I suspect you wanted to use "[[" rather than "[". Downvote is for failing to provide `[MCVE]` — IRTFM, Feb 01 '17 at 04:54
@42- I test the code on a small subset of the whole data, it works. So I assume it is a problem of large data. The size is around 100 MB — Sheldon, Feb 01 '17 at 04:57
100MB is not particularly large (depending on how you are measuring "data"). — IRTFM, Feb 01 '17 at 04:57
I agree. But considering that we have thousands of dummy variables, this would be a huge sparse matrix, I guess that might be the reason of R failure. Is there any method to handle this? — Sheldon, Feb 01 '17 at 05:22

akrun · Accepted Answer · 2017-02-01T05:49:45.970

1

We can just use table

as.data.frame.matrix(table(df1))
#  A B C D
#1 1 1 0 0
#3 0 0 1 0
#4 0 0 0 1
#5 0 0 0 2

Or an efficient approach would be dcast from data.table

library(data.table)
dcast(setDT(df1), a~b, value.var = "a", length)

data

df1 <- structure(list(a = c(1L, 1L, 3L, 4L, 5L, 5L), b = c("A", "B", 
"C", "D", "D", "D")), .Names = c("a", "b"), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")

edited Feb 01 '17 at 05:49

answered Feb 01 '17 at 05:37

akrun

874,273
37
540
662

This works for the full data set! Thanks! – Sheldon Feb 01 '17 at 15:28
hi, what if I not only want to have binary outcome (1 or 0), but also what to record the number of those columns for which A has? – Sheldon Feb 07 '17 at 22:09
@Sheldon In the output, there is only a single column for 'A'. It is not clear what you wanted – akrun Feb 08 '17 at 03:09
For example, assume I have three columns in the original data frame: ID, attributes, values. I have two IDs. ID 1 has some value for attributes A and B, while ID 2 has some value for attributes B and C. – Sheldon Feb 08 '17 at 03:15
Say I want to convert `data.frame(ID=c(1,1,2,2),att=c('a','b','b','c'),values=c(1,2,3,4))` to `data.frame(ID=c(1,2),a=c(1,'NA'),b=c(2,3),c=c('NA',4))` – Sheldon Feb 08 '17 at 03:16
@Sheldon If your first data is `df1`, `library(reshape2); dcast(df1, ID~att, value.var = "values")` – akrun Feb 08 '17 at 03:20
I simply replace your 'a' by ID and 'b' by attributes in my case, and it works in the sense that it converts to a data frame describing whether a ID has an attribute or not (binary). But now I am wondering whether we can record the attributes value for that ID instead of just writing 1 or 0. – Sheldon Feb 08 '17 at 03:22
@Sheldon I think I answered your previous comment with `dcast`. – akrun Feb 08 '17 at 03:23
I used `dcast(df1, ID~att, value.var = "values")` and it reports: Aggregation function missing: defaulting to length. Also, the values I got is still 1 or 0 depending on whether ID has this attributes, not the correct continuous value in the old df. – Sheldon Feb 08 '17 at 03:35
@Sheldon It is based on the example you showed. If it is having dupe elements, `dcast(setDT(df1), ID+rowid(ID)~att, value.var = "values")` it is not clear without a reproducible example `library(data.table)` – akrun Feb 08 '17 at 03:39
Do you know how to limit the columns of results to be the specified names. Say there are 100 potential string values for "b", but I am only interested in a few values: names=("b1", "b2"). I want to make sure that the results have column names as: a, b1, b2 instead of a, b1, b2, b3, ... and all other potential values for "b" – Sheldon Apr 17 '17 at 18:33
@Sheldon IN that case you can subset the 'b' column first.. `subset(df1, b %in% c("A", "B"))` and then do the table on `dcast` on it – akrun Apr 18 '17 at 04:01

How to add dummy variables in R for a large data set

1 Answers1

data