There is 3 dataset :df2017 ,df2016 , df2018 .
# 2017 Dataset name :df2017
PoNo. Price R.No I.No RI.No Date
1 107 300 100;244 464;355 100-464;244-355 2017-02-23;2017-09-03
2 109 100 334 161 334-161 2017-09-23
3 124 504 124 115 124-115 2017-11-03
4 207 233 484;450;605 215;444;415 484-215;450-444;605-415 2016-02-03;2016-05-23;2017-04-10
5 223 4667 484;450;605 225;544;615 484-225;450-544;605-615 2017-02-03;2017-05-23;2018-04-03
6 340 1201 448;555;989 212;223;499 448-212;555-223;989-499 2017-01-13;2017-05-24;2017-12-03
7 345-1 2344 471;566 568;394 471-568;566-394 2017-04-03;2017-05-02
8 345-2 0 "" "" "" ""
Desired Shape : Split the line and allocate to the dataset which belongs to .
Split the valuables by " ; " of the col : $R.No,$I.No ,$RI.No ,$Date and duplicate the others columns .
Caculate col : $Price , divide by how many times of Date show in one cell . like --: price / 1+(n(;)).
# 2017 Dateset --after split PoNo. Price R.No I.No RI.No Date 1 107 150 100 464 100-464 2017-02-23 1 109 150 244 355 244-355 2017-09-03 2 107 100 334 161 334-161 2017-09-23 3 124 504 124 115 124-115 2017-11-03 4 207 77.66667 484 215 484-215 2016-02-03 4 207 77.66667 450 444 450-444 2016-05-23 4 207 77.66667 605 415 605-415 2017-04-10
If the valuables in Date show different year , move the whole row to the dataset which is the same year.
# 2016 dataset name :df2016 PoNo. Price R.No I.No RI.No Date 1 147 1133 120 414 120-414 2016-01-03 2 229 1512 354 395 354-395 2016-07-05 3 117 130 384 261 384-261 2016-10-13 4 207 77.66667 484 215 484-215 2016-02-03 4 207 77.66667 450 444 450-444 2016-05-23
There is 2018 dataset with same operation as well (of course including 2016).
- Each actual dataset contains over 1 milion rows and 100 cols , how many times ";" appear in the valuables is unknown . How can i achieve it with effective operation ?(as they are too big)