0

I've been searching arround SO, and other data science and programming blogs, but i haven't found an answer for my specific need. So, if you find this question duplicate please be kind enough and point me to source of the information and close/delete this question.

My real data will have several thousands of rows, so i display here only a small set of fictional data, that resembles very similarly my original data:

Data <- data.frame(CategoryA =c("Xpto1","Xpto1","Xpto1","Xpto1","Xpto2","Xpto2","Xpto2","Xpto1","Xpto1", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3"),CategoryB = c("Type1","Type1","Type1","Type1","Type1","Type1","Type1","Type2","Type2", "Type1", "Type1", "Type1", "Type1", "Type1", "Type1"),ID = c(1,1,1,1,2,2,2,3,3,4,4,4,4,4,4),Year =c(2014,2015,2016,2017,2007,2009,2010,2014,2016, 1997,2002,2010,2012,2013,2015),Class = c("New","Existing", "Existing", "Lost","New", "Existing", "Existing", "New", "Existing","New", "Lost", "Out","Recovered", "Existing", "Existing"))

I would post the data frame but i don't know how to do it properly in SO, althoug i'v read the proper thread about reproducible example: How to make a great R reproducible example?

My best attemp so far is as follows:

Data %>% 
group_by(CategoryA, CategoryB, ID) %>% 
complete(nesting(CategoryA, CategoryB, ID), Year = seq.int( min(Year), max(Year) ) ) %>%
arrange( ID, Year )

However, this aproach has 2 problems:

1 - it only adds rows to the missing years between min and max year of each group;

2 - when this aproach is used in the real data, since the dataset contains about 200K observations, the process is real slow (R takes about 2 or 3 minutes to complete the task at hand).

The explanation of the problem at hand, and rules of row assignation:

Given a list of item ID's, the Year, the Category A and the Category B variables, add rows in the data set in a way such as:

1 - given the max and min year for each group, insert row in missing year and assign the value "Existing" to the Class variable in that missing year (i'm assuming that the solution will automatcally fill Category A and Category B and ID values for each given group that has missing years);

2 - if in the max year of the given group the item is classified as "Existing" and the max year of that same group if inferior of todays date year, then add as many rows as necessary till todays date year and classify the item as "Existing".

Please note: i'm defining a group as the combination of variables: CategoryA, categoryB and ID

Please let me know if you found my rules explanation confusing, so i can have a chance to clarify them for you.

Thank you in advance, for any help you could provide.

EDIT: I updated the data set to better reflect the real Data.

Cheers! :)

zStrike
  • 57
  • 6

1 Answers1

3

A possible solution using :

library(data.table)
setDT(Data)[, .SD[CJ(Year = seq(min(Year), ifelse(Class[which.max(Year)] == "Existing",
                                                  year(Sys.Date()), max(Year))))
                  , on = .(Year)]
            , by = .(ID, CategoryA, CategoryB)
            ][is.na(Class), Class := "Existing"][]

which gives:

    ID CategoryA CategoryB Year    Class
 1:  1     Xpto1     Type1 2014      New
 2:  1     Xpto1     Type1 2015 Existing
 3:  1     Xpto1     Type1 2016 Existing
 4:  1     Xpto1     Type1 2017     Lost
 5:  2     Xpto2     Type1 2007      New
 6:  2     Xpto2     Type1 2008 Existing
 7:  2     Xpto2     Type1 2009 Existing
 8:  2     Xpto2     Type1 2010 Existing
 9:  2     Xpto2     Type1 2011 Existing
10:  2     Xpto2     Type1 2012 Existing
11:  2     Xpto2     Type1 2013 Existing
12:  2     Xpto2     Type1 2014 Existing
13:  2     Xpto2     Type1 2015 Existing
14:  2     Xpto2     Type1 2016 Existing
15:  2     Xpto2     Type1 2017 Existing
16:  2     Xpto2     Type1 2018 Existing
17:  3     Xpto1     Type2 2014      New
18:  3     Xpto1     Type2 2015 Existing
19:  3     Xpto1     Type2 2016 Existing
20:  3     Xpto1     Type2 2017 Existing
21:  3     Xpto1     Type2 2018 Existing
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Your code is working good. However, i failed to supply a proper data set that would reflect more precisely my real data. I edited my question. Would you give another go, working your code in order to adapt to the new scenario, please? Thank you ! – zStrike Jul 07 '18 at 16:11
  • @zStrike As far as I can see, the code is doing what you want. If not, could you describe what is wrong with the output? It would be very helpful if you included the desired output. – Jaap Jul 07 '18 at 17:01
  • Thanks for taking the time to take another look on this problem and sorry i didn't answer earlier, but with the weekend going on family is arround :) . Anyway, if you use your code on the new dataset i edited yesterday, you will notice that for the item with ID 4, beetween Year 2002 and Year 2010 the item got lost in Year 2002 and out in year 2010, this means the item is not existing between those years and so the code should not add years between year 2002 and year 20012 (the year that the item got recovered), i should have posted that on the rules requirements i'm sorry about that. – zStrike Jul 08 '18 at 14:12