0

I have a massive, pipe-delimited .txt file (300 GB) that I'm trying to split into 1 GB files for further analysis in Python. My PC does not have enough space for another 300 GB, though, so I would like to delete chunks of the original file as I split it. The file also has a header that I would like to keep in all the split files.

I have tried splitting it in Bash, but cannot figure out a way to this while deleting the original file. The file is too big to load into Python in full.

Edit: I want to do something like this, but with a header:

https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original

  • 1
    300GB file/data/input in size is not something you process with the shell... – Jetchisel May 27 '23 at 23:46
  • 1
    Please edit your question (no comment): What have you searched for, and what did you find? What have you tried, and how did it fail? – Cyrus May 27 '23 at 23:49
  • how much free space do you have? do you have an external drive (USB, enclosure, etc) you could use (either move the 300GB, or write the new files to the external drive)? – markp-fuso May 27 '23 at 23:53
  • What is a "pipe-delimited text file" ? Do you mean you want to split the file such that every split falls on a "pipe"? Or do you mean you have a CSV file with a header where the delimiter is "pipe" instead of comma? If the latter, can there be quoted newlines or "pipes" inside fields? – jhnc May 27 '23 at 23:54
  • This might help: [Truncating the first 100MB of a file in linux](https://stackoverflow.com/q/18072180/3776858) – Cyrus May 27 '23 at 23:59
  • 1
    Copy the *last* gig of the file to a new one (adjusting starting point to the beginning of a record as needed), then truncate the original to remove that copied part. Repeat until original is empty. I wouldn't use shell for the job – Shawn May 28 '23 at 00:44
  • As @Shawn pointed out, you'll need to `truncate` the file, which can't be done with standard shell features. Now, I somehow understand why you'd want to split your data file: almost all Python codes load the whole data into RAM, even when it's not required; with a 300GB file, that kind of program is bound to break... So, why don't you read your data line-by-line, build an array of the target size, analyse it, and start anew with the next chunk of data? – Fravadona May 28 '23 at 08:27
  • See [remove-first-n-lines-of-a-file-in-place-in-unix-command-line](https://stackoverflow.com/questions/17330188/remove-first-n-lines-of-a-file-in-place-in-unix-command-line/17331179#17331179) – Ed Morton May 28 '23 at 13:26
  • This is easily done with four passes over the data. The only complicated part of the problem is computing the offsets for the splits but your question provides insufficient detail to advise on how best to do that. – jhnc May 28 '23 at 14:31
  • 1
    Insted af splitting in 300 files and calling Python 300 times, can you just get the next GB, run Python, delete the temp 1 Gb and create e new one was a changed offset? – Walter A May 28 '23 at 15:39

1 Answers1

2

Assumptions:

  • data fields do not include embedded linefeeds otherwise the head and/or tail commands could (erroneously) split data lines

Expanding on this answer to the unix.stackexchange.com link provided by OP:

numfiles=100                                       # OP determines beforehand how many files to create
numlines=100000                                    # OP determines beforehand how many lines to move to each new file

head -1 bigfile > header                           # make a copy of the header line

for ((i=numfiles; i>1; i--))
do
    newf=newfile.$i
    cp header "${newf}"
    tail -${numlines} bigfile >> "${newf}"
    truncate -s -$(wc -c < "${newf}") bigfile
done

mv bigfile newfile.1                               # rename what's left of the original file

NOTE: requires truncate (part of the GNU coreutils, eg, sudo apt-get install coreutils)

Performance:

  • bigfile : 10 million lines, 810 MBytes
  • 10 seconds: cygwin running in Win10 virtual machine (Ubuntu host, NVME Gen4 PCIe drive)
  • 2 seconds: running directly on the same Ubuntu host
markp-fuso
  • 28,790
  • 4
  • 16
  • 36