2

So I am interested in splitting rather large files into 5Gig intervals. My goal is to have ALL partitions less than 5Gigs and the least ammount of partitions as possible.

While I WOULD normally use the split with a size limit, I need to ensure that lines remain intact (Which I cannot get split by size to do).

I have been contemplating using the file size and line count to determine the number of lines I could split per file

e.g.

File size = 11Gig
File line count = 900
File limit = 5Gig
ceiling(11/5) = 3
900/3 = 300
#Split the file by line limiting 300 each.

While this would probably usually work, due to the nature of line elements file sizes COULD still be above 5gigs if there is one extremely large line in a segment of the file.

I'm contemplating using python (It handles numbers much better and seems less hackish), but then I would loose bashes file manipulation speed.

I'm wondering if anyone knows of a better alternative in bash?

Thank you in advance!

JarODirt
  • 157
  • 11
  • 900 lines is nothing, you can just loop over it. I'd loop over each line, keep a counter for the bytes written to the current file. If it, with the current line, goes above 5G, start a new file. If the line itself is over 5G, throw an error. (If you have to save memory, this will get somewhat more complicated, but still possible.) – Carsten Aug 20 '15 at 20:42
  • 1
    Linux split has a --lines=NUMBER option to split by NUMBER of lines per output file. –  Aug 20 '15 at 20:46
  • The numbers above are not an accurate scale to what I will actually be doing, they are just given to provide an easy to understand example. – JarODirt Aug 21 '15 at 12:40

1 Answers1

2

From the split man-page:

...
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file
...

The description of this option may not be very obvious, but it seems to cover what you are asking for: the file is split at the latest possible line break before reaching SIZE bytes.

yaccob
  • 1,230
  • 13
  • 16
  • Actually that does look like it will solve my issue. Thanks for pointing that out! – JarODirt Aug 21 '15 at 12:40
  • Unfortunately, this may cut lines, namely when they are larger than `SIZE`. I'd be interested in a solution that keeps lines intact no matter what... – ingomueller.net Sep 05 '20 at 13:31