5

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.

Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:

values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]

This code reliably worked for previous columns, which had smaller unique values. firm however is larger:

tr(values)
 num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...

I get a warning when trying to add the columns:

Warning message:
  truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().

As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Provide the data, just a sample, use `dput(myDataTable[1:10])`. **Edit:** looks like related to the size of columns to be added, so sample data might be not easy to share. Did you try to set mentioned option to `length(values)`? – jangorecki Apr 13 '15 at 21:13
  • 2
    6000+ columns?!? :-O. Read `?truelength` and use `alloc.col` with `n` argument to grow spare slots to how many ever columns you're creating.. else you'll receive the warning because we've to over-allocate every time spare slots are used up.. – Arun Apr 13 '15 at 21:19
  • @Arun `ncol(myDataTable)` gives me `[1] 3085`, so that message doesn't really make sense. Do I understand correctly that I'm being inefficient every time I'm adding a huge chunk of columns which I didn't preallocate for? In that case, since this is a unique operation, I guess I'm fine. – FooBar Apr 13 '15 at 21:22
  • truelength = no: of cols + free slots. When you normally create a `data.table`, say, with 2 cols (true length is 100 by default), there are 98 free slots.. If you now add 99 cols, we've to shallow copy on the 99th time (no free slots) and over-allocate to >101 cols, assign the result back.. It's negligible, but if done too many times, can get noticeable. – Arun Apr 13 '15 at 21:26
  • 4
    In your case, over-allocate to a large number (ex: 3200, since you say 3000+ cols) using `alloc.col` for this data.table. Everything should be fine then. – Arun Apr 13 '15 at 21:29

1 Answers1

4

Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.

alloc.col(myDataTable, 3200)

Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

jangorecki
  • 16,384
  • 4
  • 79
  • 160