I need help splitting a big file (1.6 M records) into multiple files based on the maximum number of lines allowed per the sub files, with the caveat that an order should not spill across files and appear in multiple files.
Quick overview about the file: The file has order information about transaction at a retail store. Each order can have multiple items. Below is a small example of a sample file.
sample_file:
order_nu | item_nu | Sale |
---|---|---|
1 | 1 | 10 |
1 | 2 | 20 |
1 | 3 | 30 |
2 | 1 | 10 |
2 | 2 | 20 |
3 | 1 | 10 |
3 | 2 | 10 |
4 | 1 | 20 |
4 | 2 | 24 |
4 | 3 | 34 |
4 | 4 | 10 |
4 | 5 | 20 |
5 | 1 | 30 |
5 | 2 | 20 |
5 | 3 | 40 |
Is it possible to write a Linux script that can help me split a file based on the number of lines with the caveat that an order should not spill across files and appear in multiple files. For example for the above file, I need it be split with the condition that the individual sub_files should not have more than by 5 records per file, and an order should not appear in more than one file (assumption is an order will not have more than 5 items). Below is the expected output:
sub_file1 : | order_nu | item_nu | Sale | | -------- | --------|-------| | 1 | 1 | 10 | | 1 | 2 | 20 | | 1 | 3 | 30 | | 2 | 1 | 10 | | 2 | 2 | 20 |
sub_file2: | order_nu | item_nu | Sale | | -------- | --------|-------| | 3 | 1 | 10 | | 3 | 2 | 10 |
sub_file3: | order_nu | item_nu | Sale | | -------- | --------|-------| | 4 | 1 | 20 | | 4 | 2 | 24 | | 4 | 3 | 34 | | 4 | 4 | 10 | | 4 | 5 | 20 |
sub_file4: | order_nu | item_nu | Sale | | -------- | --------|-------| | 5 | 1 | 30 | | 5 | 2 | 20 | | 5 | 3 | 40 |
Please let me know if there are any questions Thank you!