-1

I need help splitting a big file (1.6 M records) into multiple files based on the maximum number of lines allowed per the sub files, with the caveat that an order should not spill across files and appear in multiple files.

Quick overview about the file: The file has order information about transaction at a retail store. Each order can have multiple items. Below is a small example of a sample file.

sample_file:

order_nu item_nu Sale
1 1 10
1 2 20
1 3 30
2 1 10
2 2 20
3 1 10
3 2 10
4 1 20
4 2 24
4 3 34
4 4 10
4 5 20
5 1 30
5 2 20
5 3 40

Is it possible to write a Linux script that can help me split a file based on the number of lines with the caveat that an order should not spill across files and appear in multiple files. For example for the above file, I need it be split with the condition that the individual sub_files should not have more than by 5 records per file, and an order should not appear in more than one file (assumption is an order will not have more than 5 items). Below is the expected output:

sub_file1 : | order_nu | item_nu | Sale | | -------- | --------|-------| | 1 | 1 | 10 | | 1 | 2 | 20 | | 1 | 3 | 30 | | 2 | 1 | 10 | | 2 | 2 | 20 |

sub_file2: | order_nu | item_nu | Sale | | -------- | --------|-------| | 3 | 1 | 10 | | 3 | 2 | 10 |

sub_file3: | order_nu | item_nu | Sale | | -------- | --------|-------| | 4 | 1 | 20 | | 4 | 2 | 24 | | 4 | 3 | 34 | | 4 | 4 | 10 | | 4 | 5 | 20 |

sub_file4: | order_nu | item_nu | Sale | | -------- | --------|-------| | 5 | 1 | 30 | | 5 | 2 | 20 | | 5 | 3 | 40 |

Please let me know if there are any questions Thank you!

Maxx c
  • 1
  • 1
  • AWK? Seems pretty straightforward if your sated assumptions are correct. `awk 'NR==1{hdr=$0; next;} { if ($1 != last) { close(last); last=$1; f="order_"$1; print hdr > f; } print > f; }' file.in` If you want to get pretty with the field separators and divider lines, `awk 'BEGIN{FS="\t"; OFS="|";} NR==1{hdr=$1"|"$2"|"$3"\n-------|-------|-------"; next;} { if ($1 != last) { close(last); last=$1; f="order_"$1; print hdr > f; } print $1,$2,$3 > f; }' file.in` - though I have no idea what you're doing with those double-pipes, so you'll have to adjust for whatever that is. – Paul Hodges Dec 06 '22 at 19:36

1 Answers1

-2

Try something like this

max_lines=x
counter=1

while read line;
 do
   echo $line >> sub_file$counter.txt
   if [ `wc -l < sub_file$counter.txt` -gt $max_lines ]
   then
     counter=$((counter+1))
   fi
done < sample_file.txt
  • 1
    This is so inefficient as to be criminal in some jurisdictions. Perhaps see also https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this – tripleee Dec 06 '22 at 19:11