1

There is 3 dataset :df2017 ,df2016 , df2018 .

    # 2017 Dataset name :df2017
    PoNo.  Price  R.No        I.No          RI.No                   Date
1   107    300    100;244     464;355       100-464;244-355         2017-02-23;2017-09-03
2   109    100    334         161           334-161                 2017-09-23
3   124    504    124         115           124-115                 2017-11-03
4   207    233    484;450;605 215;444;415   484-215;450-444;605-415 2016-02-03;2016-05-23;2017-04-10
5   223   4667    484;450;605 225;544;615   484-225;450-544;605-615 2017-02-03;2017-05-23;2018-04-03
6   340   1201    448;555;989 212;223;499   448-212;555-223;989-499 2017-01-13;2017-05-24;2017-12-03
7   345-1 2344    471;566     568;394       471-568;566-394         2017-04-03;2017-05-02
8   345-2  0       ""          ""            ""                     ""                               

Desired Shape : Split the line and allocate to the dataset which belongs to .

  • Split the valuables by " ; " of the col : $R.No,$I.No ,$RI.No ,$Date and duplicate the others columns .

  • Caculate col : $Price , divide by how many times of Date show in one cell . like --: price / 1+(n(;)).

      # 2017 Dateset --after split
          PoNo.   Price  R.No    I.No    RI.No     Date
    
       1   107    150    100     464    100-464   2017-02-23
       1   109    150    244     355    244-355   2017-09-03
       2   107    100    334     161    334-161   2017-09-23 
       3   124    504    124     115    124-115   2017-11-03
       4   207    77.66667 484   215    484-215   2016-02-03
       4   207    77.66667 450   444    450-444   2016-05-23
       4   207    77.66667 605   415    605-415   2017-04-10
    
  • If the valuables in Date show different year , move the whole row to the dataset which is the same year.

          # 2016 dataset name :df2016 
              PoNo.   Price  R.No    I.No    RI.No     Date
    
           1   147    1133    120     414    120-414   2016-01-03
           2   229    1512    354     395    354-395   2016-07-05
           3   117    130     384     261    384-261   2016-10-13 
           4   207    77.66667 484   215    484-215   2016-02-03
           4   207    77.66667 450   444    450-444   2016-05-23
    
  • There is 2018 dataset with same operation as well (of course including 2016).

  • Each actual dataset contains over 1 milion rows and 100 cols , how many times ";" appear in the valuables is unknown . How can i achieve it with effective operation ?(as they are too big)
Napbien Cole
  • 25
  • 1
  • 4

1 Answers1

1

Here is a solution with tidyverse packages.

df <- structure(list(PoNo. = c("107", "109", "124", "207", "223", "340", 
"345-1", "345-2"), Price = c("300", "100", "504", "233", "4667", 
"1201", "2344", "0"), R.No = c("100;244", "334", "124", "484;450;605", 
"484;450;605", "448;555;989", "471;566", NA), I.No = c("464;355", 
"161", "115", "215;444;415", "225;544;615", "212;223;499", "568;394", 
NA), RI.No = c("100-464;244-355", "334-161", "124-115", "484-215;450-444;605-415", 
"484-225;450-544;605-615", "448-212;555-223;989-499", "471-568;566-394", 
NA), Date = c("2017-02-23;2017-09-03", "2017-09-23", "2017-11-03", 
"2016-02-03;2016-05-23;2017-04-10", "2017-02-03;2017-05-23;2018-04-03", 
"2017-01-13;2017-05-24;2017-12-03", "2017-04-03;2017-05-02", 
NA)), .Names = c("PoNo.", "Price", "R.No", "I.No", "RI.No", "Date"
), problems = structure(list(row = 8L, col = NA_character_, expected = "7 columns", 
    actual = "6 columns", file = "'data.tsv'"), .Names = c("row", 
"col", "expected", "actual", "file"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), row.names = c(NA, -8L), spec = structure(list(
    cols = structure(list(PoNo. = structure(list(), class = c("collector_character", 
    "collector")), Price = structure(list(), class = c("collector_integer", 
    "collector")), R.No = structure(list(), class = c("collector_character", 
    "collector")), I.No = structure(list(), class = c("collector_character", 
    "collector")), RI.No = structure(list(), class = c("collector_character", 
    "collector")), Date = structure(list(), class = c("collector_character", 
    "collector")), X7 = structure(list(), class = c("collector_character", 
    "collector"))), .Names = c("PoNo.", "Price", "R.No", "I.No", 
    "RI.No", "Date", "X7")), default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df", 
"tbl", "data.frame"))

Here is the part where we computed the Price by dividing it by the number of dates (number of ; + 1). Then we applied dalloliogm's solution to each column with entries separated by semi-colons using dplyr::mutate_at().

df2017 <- df %>% 
  mutate(Price = as.numeric(Price) / (str_count(Date, ";")+1)) %>%
  mutate_at(vars(R.No:Date), function(x) strsplit(as.character(x), ";")) %>% 
  unnest()

You can now filter the entries by year and bind those entries to their corresponding data set.

df2018 <- df2017 %>%
  filter(year(Date) == 2018)

df2016 <- df2017 %>%
  filter(year(Date) == 2016)

df2017 <- df2017 %>%
  filter(year(Date) == 2017)

df2017

##  A tibble: 12 x 6
#    PoNo.  Price R.No  I.No  RI.No   Date      
#    <chr>  <dbl> <chr> <chr> <chr>   <chr>     
#  1 107    150.  100   464   100-464 2017-02-23
#  2 107    150.  244   355   244-355 2017-09-03
#  3 109    100.  334   161   334-161 2017-09-23
#  4 124    504.  124   115   124-115 2017-11-03
#  5 207     77.7 605   415   605-415 2017-04-10
#  6 223   1556.  484   225   484-225 2017-02-03
#  7 223   1556.  450   544   450-544 2017-05-23
#  8 340    400.  448   212   448-212 2017-01-13
#  9 340    400.  555   223   555-223 2017-05-24
# 10 340    400.  989   499   989-499 2017-12-03
# 11 345-1 1172.  471   568   471-568 2017-04-03
# 12 345-1 1172.  566   394   566-394 2017-05-02
hpesoj626
  • 3,529
  • 1
  • 17
  • 25
  • Perfect . very lean operation that benefit the beginner like me a lot . Add bind_rows() to group back all of them can complete what i need . thankyou very much. – Napbien Cole Mar 10 '18 at 15:31