1

I have a huge sized text file and I would like to split it based on:

  • size (2GB each file; Last file could be less than 2GB)
  • new line (files should split at the end of the new line)
  • have new files as a text file
  • and add this is first line at the beginning of each files

For example: for a file like this-

textdata1
textdata2
textdata3
textdata4
textdata5
textdata6

I would like to have output as: textfile_1.txt

this is first line
textdata1
textdata2

and may be- textfile_2.txt

this is first line
textdata3
textdata4
textdata5
textdata6

I tried with -b <size> command but it splits right in the middle of the line.

Tanvir
  • 174
  • 1
  • 2
  • 17
  • Use `split -l ` to specify the number of lines per file. It doesn't automatically add the prefix line, you'll need to do that yourself in the script after splitting. – Barmar Jun 20 '23 at 16:46
  • See [this question](https://unix.stackexchange.com/questions/298700/how-to-split-a-6-or-7-gb-file-into-several-sub-2-gb-files-without-splitting-entr) on [unix.se] – Barmar Jun 20 '23 at 16:48
  • @Barmar the split will not be based on line counts. That's the issue here. – Tanvir Jun 20 '23 at 16:58
  • GNU `split` also has a `-C` option that will create a slightly smaller file if the requested number of bytes splits a line. – chepner Jun 20 '23 at 17:20
  • 1
    It's not clear that there's a condition you can use to make the uneven split you request here. Why is `textdata3` in the second file instead of the first? – chepner Jun 20 '23 at 17:21
  • @Tanvir so if 2 GB is in the middle of a line, do you want a file bigger than 2GB that ends at the end of the line? Or do you want to stop at the previous line and have the file smaller than 2GB. I am sure a bash script could do what you need but your requirement is ambigouus. – Joseph Ishak Jun 20 '23 at 17:35
  • @Tanvir Did you see the answer in the question I linked to? It might require GNU split, though. – Barmar Jun 20 '23 at 17:54
  • @Barmar that wouldn't add the `this is first line` header that the OP wants added to each output file and `split()` wouldn't have a way to take the length of that line into consideration when calculating the overall output file size to ensure it's under 2G. I don't think there's any way to use `split()` for the OPs requirements. – Ed Morton Jun 20 '23 at 17:57
  • 1
    @EdMorton As I said above, he'd have to add that himself after splitting. And it should be trivial to subtract the length of that line from the maximum chunk size. – Barmar Jun 20 '23 at 18:04
  • @Barmar, ah, I missed that you'd mentioned that. You should post an answer since the solution would be more than just what's in that question you referred to. – Ed Morton Jun 20 '23 at 18:06
  • 1
    That part is answered in [how to insert a text at the beginning of a file](https://stackoverflow.com/questions/9533679/how-to-insert-a-text-at-the-beginning-of-a-file) – Barmar Jun 20 '23 at 18:15
  • There's also calculating the length of the header to subtract from the max to pass to `split()`, which I'm sure there's also answers to in the forum, but it'd be nice to see a complete solution to the OPs whole problem. No worries either way though. – Ed Morton Jun 20 '23 at 18:20

2 Answers2

0

If each character is a byte then you could do something like this (untested, using any awk):

awk '
    BEGIN {
        maxLgth = 2 * (1000 ^ 2)     # or use 1024 if appropriate
        hdr = "this is first line"
        outLgth = maxLgth + 1        # to ensure "out" gets populated for first line
    }
    {
        lineLgth = length($0) + 1    # +1 for the newline that print adds
        if ( (outLgth + lineLgth) > maxLgth ) {
            close(out)
            out = "out" (++outCnt)
            print hdr > out
            outLgth = length(hdr)
        }
        print > out
        outLgth += lineLgth
    }
' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

This python script can help you

import os
    
    def split_file(file_path, size_limit=2*1024**3, first_line="this is the first line"):
        counter = 1
        output_file = None
    
        with open(file_path, 'r') as f:
            for line in f:
                # If output_file is None or size limit exceeded, create a new file
                if output_file is None or os.path.getsize(output_file.name) + len(line.encode('utf-8')) > size_limit:
                    if output_file is not None:
                        output_file.close()
                    output_file = open(f"textfile_{counter}.txt", 'w')
                    output_file.write(first_line + "\n")
                    counter += 1
                output_file.write(line)
    
        if output_file is not None:
            output_file.close()
    
    # Call the function
    split_file("yourfile.txt")