6

I have a data.table in R that was fetched from a database that looks like this:

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-04-01,2,bar,1,100,200
2014-05-01,2,bar,1,100,200
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-05-01,4,quux,2,100,200
<SNIP>

In order to do some calculations on the data, I'd like to massage it so that each combination of date, identifier, description and location has a row in the table with NA as value1 and value2. I know the range of date and all potential values for location.

I'm new to both R and data.table and my mind is mush at this point. The result I'd like to come up with for the above sample table is:

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-03-01,2,bar,2,NA,NA
2014-04-01,2,bar,1,100,200
2014-04-01,2,bar,2,NA,NA
2014-05-01,2,bar,1,100,200
2014-05-01,2,bar,2,NA,NA
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-03-01,4,quux,1,NA,NA
2014-03-01,4,quux,2,NA,NA
2014-04-01,4,quux,1,NA,NA
2014-04-01,4,quux,2,NA,NA
2014-05-01,4,quux,1,NA,NA
2014-05-01,4,quux,2,100,200

The data in the database is sparse in that a given identifier/description/location combination could have any number of entries or none at all for each date. I want to get to for a given date range (e.g., 2014-03-01 through 2014-05-01) each identifier/description and location has a row in the table.

This seems like something there'd be an interesting data.table trick to do, but I'm blanking.

Edit: I did this on a smaller scale for one identifier/description by merging in another datatable but I'm not sure how to do this with the added complexity of multiple identifier/descriptions and locations.

Thanks very much for your responses.

Here is dput output of the original data that can be readily copied into R:

structure(list(date = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), 
.Label = c("2014-03-01", "2014-04-01", "2014-05-01"), class = "factor"), 
identifier = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L),     
description = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 4L), 
.Label = c("bar", "baz", "foo", "quux"), class = "factor"), 
location = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), 
value1 = c(100L, 200L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 200L, 100L, 100L, 100L, 100L, 100L), 
value2 = c(200L, 300L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 300L, 200L, 200L, 200L, 200L, 200L)), 
.Names = c("date", "identifier", "description", "location", "value1", "value2"), 
row.names = c(NA, -16L),
class = c("data.table", "data.frame"))
  • 1
    `rbind(DT0,DT0[,.(value1=NA_integer_,value2=NA_integer_),by=.(date,identifier,description,location)])` fits your verbal description, but your given output only alters `bar` rows. – Frank May 13 '15 at 16:37
  • Have you tried something like... `read.csv('filename',sep = ',')` and than using the date format for first column?? Or maybe I missundertood question... – Kamil S Jaron May 13 '15 at 16:37
  • If that's the right answer, this question is almost a dupe of http://stackoverflow.com/q/30109017/1191259 – Frank May 13 '15 at 16:40
  • @Slim You can read in the data with `read.csv(header=TRUE,text="copypaste_OP_text_here")` – Frank May 13 '15 at 16:40
  • 1
    @Slim you are completely of course here. @Frank, no need in `header = TRUE` – David Arenburg May 13 '15 at 16:52
  • @Frank, I will play with what you provided, but I don't understand it (yet). I have updated the question data with another example in case it helps clarify. – Jerry Smithwell May 13 '15 at 16:54
  • @JerrySmithwell It is easiest to help if your example data is minimal (as simple as possible to illustrate the problem) and self-contained (with the desired output corresponding to the example input). Here's a reference: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Frank May 13 '15 at 16:58
  • @Frank Thanks for your pointer. Basically I have period spanning several months (in this case 3 months, 2014-03-01 through 2014-05-01), 4 identifiers and 2 locations that could be reporting value1 or value2 in 0 or more periods. In the case of 3 months, and two possible locations, i want there to always be 6 rows for each identifier (one for location in each of the 3 periods) with the 'missing' values as NA. (Thank you for your patience) – Jerry Smithwell May 13 '15 at 17:03
  • @Frank This is definitely not the polite way how to load data. Usually you would like to be able update source file and repeat analysis without changing the script. This actually the reason, why to use R before different statistical software. – Kamil S Jaron May 13 '15 at 17:10
  • @Slim I don't think I understand. I agree that it's best if the OP post data in R format, not delimited text, especially if date and string columns are involved (since `read.csv` will probably give those columns different formats than the OP truly has), if that's what you mean..? I'm not advocating writing a program with `"copypaste_of_OP_text"` in it, if that's what you meant. – Frank May 13 '15 at 17:13
  • 1
    I've updated the question with dput output to easily replicate the data in R. Sorry for my misunderstanding as to what was being asked. – Jerry Smithwell May 13 '15 at 17:19
  • Based on the description, you would be getting 32 rows, but the expected output is only 24 rows. In addition to @Franks' method `DT0[, .SD[c(1:.N, .N+1)], .(date, identifier, description, location)]` – akrun May 13 '15 at 17:31
  • @akrun It should be 24 rows - each identifier should have 2 locations and 3 dates and there are 4 identifiers. I would like the identifiers with missing locations or dates to populate those missing locations and dates with NA for value1 and value2 ('identifier' will always have the same 'description') – Jerry Smithwell May 13 '15 at 17:33
  • @akrun My first guess was wrong. Something with this set is involved: `do.call(CJ,with(DT0,lapply(list(description,date,location),unique)))` (I'm not planning to make an answer any time soon, btw, so go ahead if you figure it out.) – Frank May 13 '15 at 17:36
  • @Frank that's exactly what i'm looking for, but maintaining the original values and populating NA in the value1/value2s that have none in the original. – Jerry Smithwell May 13 '15 at 17:40
  • 1
    @Frank Thanks, looks like some modification is needed. I think you may need to join after setting the `key` columns – akrun May 13 '15 at 17:40
  • 1
    @akrun Fyi, `CJ` sets keys, though the documentation says so very strangely. (Instead of saying its key is set, the doc says it gains "sorted" [the object attribute corresponding to having a key].) – Frank May 13 '15 at 17:50
  • @Frank Thanks, new info for me. So, the `setkey` part is not neede – akrun May 13 '15 at 17:51
  • 1
    @Frank At fist I thought according to post, that the problem is in loading the data.frame structure. Than, I just wanted to point out, that the data should be usually loaded from the file, because it seemed to me, that you are suggesting to use the "pasting" method. But I see, what did you meant now. – Kamil S Jaron May 13 '15 at 17:52

2 Answers2

4

With help from @akrun and @eddi, here's the idiomatic (?) way:

mycols  = c("description","date","location")
setkeyv(DT0,mycols)
DT1 <- DT0[J(do.call(CJ,lapply(mycols,function(x)unique(get(x)))))]
# alternately: DT1 <- DT0[DT0[,do.call(CJ,lapply(.SD,unique)),.SDcols=mycols]]

The identifier column is missing for the new rows, but can be filled:

setkey(DT1,description)
DT1[unique(DT0[,c("description","identifier")]),identifier:=i.identifier]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Frank
  • 66,179
  • 8
  • 96
  • 180
  • This is what I came up with `DT1[DT0[,c(3,1,4,2,5,6), with=FALSE], c(paste0('value', 1:2), 'identifier') := list(i.value1, i.value2, i.identifier)][, identifier:= identifier[!is.na(identifier)][1L], description][order(identifier, date, description)]` – akrun May 13 '15 at 17:59
  • Your solution is much better – akrun May 13 '15 at 18:01
  • 1
    I'm a little confused - are you trying to generalize `DT0[CJ(unique(description), unique(date), unique(location))]`, so that you don't have to write out unique's or is there smth else going on that I'm missing? – eddi May 13 '15 at 18:06
  • Thank you so much Frank & @akrun, great answer and lots of stuff for me to investigate to progress my learning! – Jerry Smithwell May 13 '15 at 18:07
  • @eddi However, the missing identifier things still need to be filled in that case. – Frank May 13 '15 at 18:08
  • 1
    @Frank true. There was a proposal some time ago to have `CJ` support data.table's as arguments. Using the function from [this issue](https://github.com/Rdatatable/data.table/pull/814), you can do: `setkey(DT0, date, identifier, description, location); DT0[CJ.dt(unique(date), data.table(unique(identifier), unique(description)), unique(location))]`; if/when [issue 1090](https://github.com/Rdatatable/data.table/issues/1090) gets implemented, regular `CJ` would be able to do it. – eddi May 13 '15 at 18:17
  • @eddi Oh, interesting workaround for the `identifier` problem here; such a `CJ.dt` function would be handy. (I might do `unique(data.table(identifier,description))` since I don't trust two calls of `unique` to maintain the 1-to-1 mapping.) Anything that simplifies the syntax for the Cartesian product of unique values would also be handy, since that's what I'm always using `CJ` for. – Frank May 13 '15 at 18:23
  • That last observation is spot on - I've never seen `CJ` used without `unique`. [FR added](https://github.com/Rdatatable/data.table/issues/1148) – eddi May 13 '15 at 19:01
  • I noticed with the latest update, the identifier column is actually included, but is NA for the rows where value1/value2 are NA. I'm still struggling with a bunch of the concepts presented here, but the second item to add the identifier column doesn't seem to rectify the issue. – Jerry Smithwell May 13 '15 at 21:38
  • 1
    @JerrySmithwell Sorry about that. I've fixed it so it should work again, using `i.identifier`. Why I had to change it: When merging two data.tables `X[Y]` with common column names (outside of those columns being merged), the `Y` columns must be referred to with `i.colname`. – Frank May 13 '15 at 21:46
2

If I understand the question rightly - and using just base R, not any special data.table:

# The fields for whose every permutation we require a row
unique.fields <- c("date", "identifier", "description", "location")
filler <- expand.grid(sapply(unique.fields, function(f) unique(foo[,f])) )
merge(filler, foo, by=unique.fields,  all.x=TRUE)
Reign of Error
  • 265
  • 1
  • 8
  • 1
    Your `expand.grid` gives 96 rows. Because identifier and description are 1-to-1 (baz is always 2, etc.), you don't want to take their cross product. – Frank May 13 '15 at 18:34