0

I am new to R programming and i am trying to do some work in my office. I have a dataframe as below:

Sale_date    Sale_State  units_sold   Cummulative_unit_sold
30/1/2020    Kerala          1               1
1/2/2020     Kerala          1               2
2/2/2020     Kerala          2               4
3/2/2020     Tamil Nadu      1               1
3/2/2020     Rajasthan       2               2
3/2/2020     Delhi           1               1
4/2/2020     Kerala          1               5
4/2/2020     Rajasthan       1               3
etc.... So the data is in this kind.

I want the output in the below format

Sale_date    Sale_State  units_sold   Cummulative_unit_sold
30/1/2020    Kerala          1               1
1/2/2020     Kerala          1               2
2/2/2020     Kerala          2               4
3/2/2020     Tamil Nadu      1               1
3/2/2020     Rajasthan       2               2
3/2/2020     Delhi           1               1
3/2/2020     Kerala          2               4(Please observe here in rawdata no sales on 3/2/2020 for kerala state so it should copy the same data as previous date data)
4/2/2020     Kerala          1               5
4/2/2020     Rajasthan       1               3
4/2/2020     Tamil Nadu      1               1(Please observe here in rawdata no sales on 4/2/2020 for Tamil Nadu state, so it should copy the same data as previous data)

I am not getting how to do this. This is not directly something replacing NA values, but instead I want to create a dummy row of State for which there is no sales on respective day and add the previous data which is available.

Thanks, Hemanth

  • 1
    There are numerous solutions to this problem on Stack Overflow; see, for example, [this one](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value). – ulfelder Apr 19 '20 at 11:07
  • @ulfelder, As per your suggestion the method which is mentioned is to replace NA values, but in my case I do not have NA values. I have insert a new row of data as per the sale date for states which data is not present. – Hemanth_Joshi Apr 19 '20 at 11:13
  • "I have [to] insert a new row of data as per the sale date for states which data is not present"...and then fill the NA in the new row with the value from the preceding date. So, really, you've got a sequence of two problems to solve. Try solving the first one (adding rows for missing steps) and then circle back to this one. – ulfelder Apr 19 '20 at 11:17
  • On how to solve that first problem, see for example the help for `expand_grid` or `tidyr::expand`. – ulfelder Apr 19 '20 at 11:18
  • @ulfelder, So I have completed the first step by creating a dataframe as: '"temp <-expand_grid(unique(Sale_state),unique(Sale_date))"'. Now how to get the "unit_sold" values of respective state and respective date. – Hemanth_Joshi Apr 19 '20 at 12:04
  • Does this answer your question? [Replacing NAs with latest non-NA value](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) – Len Greski Apr 19 '20 at 14:20
  • @LenGreski, yes this post helps me,but before replacing the NAs I have to join 2 data frames. I am trying to do a left outer join. But i have 2 variables called Sale_state and Sale_date, so I must do a left outer join by considering both variables. Let me know how can I do that. TIA – Hemanth_Joshi Apr 19 '20 at 17:19
  • @ulfelder, How can I do a left outer join by using 2 different columns(i.e Sale_state and Sale_date) ? TIA – Hemanth_Joshi Apr 19 '20 at 17:20

1 Answers1

0

To get the question from the queue. (; Best is to use tidyr::complete (variant of tidyr::expand) and tidyr::fill as already suggested @ulfelder:

library(tidyr)
library(dplyr)

tidyr::complete(df, Sale_date, Sale_State) %>% 
  # Fill by Sale_State
  group_by(Sale_State) %>%
  # Arrange by Date
  arrange(as.Date(Sale_date, "%d/%m/%Y")) %>% 
  # fill with data from previous dates
  tidyr::fill(units_sold, Cummulative_unit_sold) %>% 
  ungroup() %>% 
  # To get rid of obs from previous dates created by tidyr::complete
  dplyr::filter(!is.na(units_sold))

#> # A tibble: 11 x 4
#>    Sale_date Sale_State units_sold Cummulative_unit_sold
#>    <fct>     <fct>           <int>                 <int>
#>  1 3/2/2020  Delhi               1                     1
#>  2 4/2/2020  Delhi               1                     1
#>  3 30/1/2020 Kerala              1                     1
#>  4 1/2/2020  Kerala              1                     2
#>  5 2/2/2020  Kerala              2                     4
#>  6 3/2/2020  Kerala              2                     4
#>  7 4/2/2020  Kerala              1                     5
#>  8 3/2/2020  Rajasthan           2                     2
#>  9 4/2/2020  Rajasthan           1                     3
#> 10 3/2/2020  Tamil Nadu          1                     1
#> 11 4/2/2020  Tamil Nadu          1                     1

Created on 2020-04-19 by the reprex package (v0.3.0)

stefan
  • 90,330
  • 6
  • 25
  • 51
  • Hello Stefan, Thanks for letting me know about complete function. But when I tried your solution which you provided above,I am getting multiple duplicate values. I mean values are getting repeated. How can i get rid of it. I had applied the filter function which you have applied at the end, but only the records which are having unwanted NA's are getting removed. Need your help. TIA – Hemanth_Joshi Apr 19 '20 at 17:17
  • I am getting output something like the below: ````````````````````````````````````` Sale_Date Sale_State Cumm_Sum 1 2020-01-30 Kerala 1 2 2020-02-02 Kerala 2 3 2020-02-03 Kerala 3 4 2020-03-02 Delhi 1 5 2020-03-02 Kerala 3 6 2020-03-02 Telangana 1 7 2020-03-02 Delhi 1 8 2020-03-02 Kerala 3 9 2020-03-02 Telangana 1 10 2020-03-03 Delhi 1 11 2020-03-03 Kerala 3 12 2020-03-03 Rajasthan 1 13 2020-03-03 Telangana 1 ````````````````````````````````````` – Hemanth_Joshi Apr 19 '20 at 17:34
  • @Hemanth_Joshi Hm. Hard to check this out. First. Check whether your raw data contains any duplicates e.g. `sum(duplicated(df[, c("Sale_Date", "Sale_State")])` will give you the number of duplicated obs for State and Date. To view the duplicated entries you may use `df[duplicated(df[, c("Sale_Date", "Sale_State")]),]`. Second. To get rid of duplicates with same values for all columns you can use `dplyr::distinct(df)`. If you want to drop obs with same values for specific columns you may e.g. use `dplyr::distinct(df, Sale_Date, Sale_State, .keep_all = TRUE)`. – stefan Apr 19 '20 at 19:51