4

I have a data set that looks like this:

shop_id,item_id,time,value
150,1,2015-07-10,3
150,1,2015-07-11,5
150,1,2015-07-13,2
150,2,2015-07-10,15
150,2,2015-07-12,12

Within each group, defined by "shop_id and "item_id", there are missing dates.

I wish to expand this irregular the time series to a regular, with consecutive dates, within each group:

shop_id,item_id,time,value
150,1,2015-07-10,3
150,1,2015-07-11,5
150,1,2015-07-12,0 # <~~ added
150,1,2015-07-13,2
150,2,2015-07-10,15
150,2,2015-07-11,0 # <~~ added
150,2,2015-07-12,12

For the dates which are added, the corresponding values should by zero. I've read very similar questions though (either using R or SQL coalescing), but most of the solutions I've seen doesn't involve GROUP BYs.

Basically I have access to the SQL database/I can export as CSV for manipulation preferably in C#. Was hoping to find C# libraries that can do such data manipulation but couldn't find any.

Any advice or help is appreciated!

Henrik
  • 65,555
  • 14
  • 143
  • 159
Jia Jian Goi
  • 1,415
  • 3
  • 20
  • 31
  • 1
    So do you want a [tag:R] based solution or [tag:SQL] based solution or only C#? If you do not want R then remove the tag as then this question does not involve R at all. Also, have you tried anything with C# yet? Even without a library (check out the [.NET DataSet](https://stackoverflow.com/questions/16606753/populating-a-dataset-from-a-csv-file)) one can open and parse a file manually – LinkBerest Jul 19 '15 at 15:08
  • To stimulate GROUP BY in R, you could use the package *plyr* for aggregating data. Try the function *daply* ( [plyr manual, page 17](https://cran.r-project.org/web/packages/plyr/plyr.pdf) ). – noumenal Jul 19 '15 at 15:22
  • Apologies for not being clear - actually any solutions are fine! – Jia Jian Goi Jul 19 '15 at 16:29

3 Answers3

11

You can use data.table from R. Assuming that 'time' column is of 'Date' class,

library(data.table)#v1.9.5+
DT1 <- setDT(df1)[, list(time=seq(min(time), max(time), by ='day')),
                    by =.(shop_id, item_id)]
setkeyv(df1, names(df1)[1:3])[DT1][is.na(value), value:=0]
#   shop_id item_id       time value
#1:     150       1 2015-07-10     3
#2:     150       1 2015-07-11     5
#3:     150       1 2015-07-12     0
#4:     150       1 2015-07-13     2
#5:     150       2 2015-07-10    15
#6:     150       2 2015-07-11     0
#7:     150       2 2015-07-12    12

In the devel version, you can also do this without setting the 'key'. Instructions to install the devel version are here

 df1[DT1, on =c('shop_id', 'item_id', 'time')][is.na(value), value:=0]
 #   shop_id item_id       time value
 #1:     150       1 2015-07-10     3
 #2:     150       1 2015-07-11     5
 #3:     150       1 2015-07-12     0
 #4:     150       1 2015-07-13     2
 #5:     150       2 2015-07-10    15
 #6:     150       2 2015-07-11     0
 #7:     150       2 2015-07-12    12

Or as @Arun suggested, a more efficient option would be

 DT1[, value := 0L][df1, value := i.value, on = c('shop_id', 'item_id', 'time')]
 DT1 
akrun
  • 874,273
  • 37
  • 540
  • 662
3

This is a Sql based solution

First you need a dates table

Date table query. Note this will create a physical table in your database.

;with cte as
(
select cast('2000-01-01' as datetime) as Dates -- Start date 
union all
select dateadd(MM,1,Dates) 
from cte 
where Dates < '2099-12-01' -- End date
)
select * 
INTO Date_table 
from CTE

Then you need to left outer join your table with Date_table to get the missing dates.

SELECT A.shop_id, 
       A.item_id, 
       DT.dates, 
       Isnull(Y.value, 0) 
FROM   date_table DT 
       CROSS JOIN(SELECT DISTINCT shop_id, 
                                  item_id 
                  FROM   yourtable) A 
       LEFT OUTER JOIN yourtable Y 
                    ON t.[time] = DT.dates 
                       AND A.shop_id = Y.shop_id 
                       AND A.item_id = Y.item_id 
Pரதீப்
  • 91,748
  • 19
  • 131
  • 172
1

Here's a solution with fill_by_value from padr:

library(dplyr)
library(tidyr)
library(padr)

df %>%
  mutate(time = as.Date(time)) %>%
  group_by(item_id) %>%
  pad() %>%              # from padr
  fill(shop_id) %>%      # from tidyr
  fill_by_value(value)   # from padr

Result:

# A tibble: 7 x 4
# Groups:   item_id [2]
  shop_id item_id       time value
    <int>   <int>     <date> <dbl>
1     150       1 2015-07-10     3
2     150       1 2015-07-11     5
3     150       1 2015-07-12     0
4     150       1 2015-07-13     2
5     150       2 2015-07-10    15
6     150       2 2015-07-11     0
7     150       2 2015-07-12    12

Data:

df = read.table(text = "shop_id,item_id,time,value
150,1,2015-07-10,3
150,1,2015-07-11,5
150,1,2015-07-13,2
150,2,2015-07-10,15
150,2,2015-07-12,12", header = TRUE, sep = ",")
acylam
  • 18,231
  • 5
  • 36
  • 45