3

I want to split a file of genomic data with 800,000 columns and 40,000 rows into a series of files with 100 columns each, total size 118GB.

I am currently running the following bash script, multithread 15 times:

infile="$1"
start=$2
end=$3
step=$(($4-1))

for((curr=$start, start=$start, end=$end; curr+step <= end; curr+=step+1)); do
  cut -f$curr-$((curr+step)) "$infile" > "${infile}.$curr" -d' '
done

However, judging by current progress of the script, it will take 300 days to complete the split?!

Is there a more efficient way to column wise split a space-delimited file into smaller chunks?

Parsa
  • 3,054
  • 3
  • 19
  • 35
  • You can try to run the cut command several times in the background for each iteration of the loop, or maybe you can think of some other way to run it in parallel. – Jacek Trociński Dec 06 '16 at 14:43
  • Possible duplicate of [How to split a large text file into smaller files with equal number of lines?](http://stackoverflow.com/questions/2016894/how-to-split-a-large-text-file-into-smaller-files-with-equal-number-of-lines) – Aserre Dec 06 '16 at 14:45
  • 2
    @Aserre: I think they are not exact duplicates, because this one asks to split by columns, whereas the proposed duplicate splits the file by lines – user000001 Dec 06 '16 at 14:48
  • I think to speed this up, you'll want to do this in one pass. I would probably try using awk or perl to first parse and cache the separate chunks, then at the end write out the chunks to files. – flu Dec 06 '16 at 15:20
  • 1
    please edit Q to include overall size of file. MBs? GBs? ALSO, for general knowledge sake, what sort of data comes with 800,000 columns?! Good luck. – shellter Dec 06 '16 at 15:24
  • 5
    This is probably going to be slow no matter what. Your current approach only uses 0.0125% of each file that it reads, resulting in a tremendous amount of redundant reading. The other approach reading the file once, but keeping 8000 files open for writing each partial line to. This is probably infeasible, although you can probably increase the number of open files your operating system allows at one time. (The alternative, opening and closing files as you write to them, is probably prohibitively slow.) This is also not something you want to do in shell. – chepner Dec 06 '16 at 15:25
  • @shellter 118GB, the data is genomic (see SNPs). – Parsa Dec 06 '16 at 15:32
  • @par : Well this is some heavy lifting! Please keep us updated on timings for any code that you try using. Good luck. – shellter Dec 06 '16 at 15:34
  • @shellter and this is the *downsized* dataset... could have up to 30 million cols and 150 thousand rows. – Parsa Dec 06 '16 at 15:37
  • @flu that would result writing a large part of the file to RAM? – Parsa Dec 06 '16 at 15:40
  • @par Yes, if you have enough memory that's the approach I would try. If you do not, then you could write out each line at a time (user000001 did this below), but I believe this would be slower. Sorry, I'm used to working with machines that have a lot of memory. Worst case, you could first split by lines, then split by columns, then join them back. – flu Dec 06 '16 at 15:44
  • Please also include output from `ulimit -a` . If you a low number of 'nofiles' getting the optimal solution will require some further work. Good luck. – shellter Dec 06 '16 at 16:01
  • bash is really slow. You might think about importing your data into a proper database, and using SQL. – glenn jackman Dec 06 '16 at 16:39
  • @chepner Opening and closing as you go is probably still a lot faster than reprocessing the huge input file 8,000 times. It should not be too hard to update the existing answer to do that, if required. – tripleee Dec 07 '16 at 04:51
  • @shellter not sure what you mean by 'no files'. I am running the script on a cluster so I would have to log into the node interactively to check this. – Parsa Dec 07 '16 at 11:20
  • @glennjackman could do this, but then would have to reexport into csv to use my statistical packages on the data. – Parsa Dec 07 '16 at 11:21
  • Probably take less than 300 days though – glenn jackman Dec 07 '16 at 14:40

1 Answers1

4

Try this awk script:

awk -v cols=100 '{ 
     f = 1 
     for (i = 1; i <= NF; i++) {
       printf "%s%s", $i, (i % cols && i < NF ? OFS : ORS) > (FILENAME "." f)
       f=int(i/cols)+1
     }
  }' largefile

I expect it to be faster than the shell script in the question.

user000001
  • 32,226
  • 12
  • 81
  • 108
  • 1
    @123 On my system `/proc/sys/fs/file-max` has a value of 1216186 which is enough to handle the files needed to split 800000 columns into blocks of 100 – user000001 Dec 06 '16 at 15:35
  • @par: I don't think that there is, because each thread will need the entire file in order to extract the columns it needs... – user000001 Dec 06 '16 at 15:58
  • I believe multiple processes can read from the same file at the same time (at least in Unix/Linux, Windows I know can be a problem). Certainly, at some point the OS will get cranky. But even doing 2 scans at the same time (might) reduce the run time by half. @par, hopefully you see that you need a way to produce chunked sets of files from the source (but of course that may be beyond your control). Good luck to all! – shellter Dec 06 '16 at 16:04
  • 1
    I think `/proc/sys/fs/file-max` is the _system-wide_ limit, whereas the limit _per process_ is reported by `ulimit -n`, which is typically lower than the 8000 open files needed in this case (without `close` calls). If you have a recent enough Linux distro (e.g., Ubuntu 16.04), you may be able to increase the limit to 8000 (Ubuntu 14.04 is capped at 4096, for instance). – mklement0 Dec 06 '16 at 16:15
  • Another concern is the max. line (record) length that a given Awk implementation can handle. Based on the overall file size reported by the OP, the input lines are about 3MB(!) each in size. It would be interesting to see which of the 3 major Awk implementations (GNU Awk, Mawk, BSD Awk) can handle that. – mklement0 Dec 06 '16 at 16:18
  • 1
    @mklement0: I tested it with a file containing 800000 columns into 8000 files and it seemed to work... `ulimit -n` though returns only 1024. My awk version is `GNU Awk 4.1.3`. Lets see if it works in OP's setup... – user000001 Dec 06 '16 at 16:26
  • perhaps one approach to getting more processors involved would be to split it up binamicly 0) 800k-col file 1) split into 400k-col files 2) split into 200k-col files 4) split into 100k-col files .... when enough processors: n) each split their chunk into 100-col files – tomc Dec 07 '16 at 03:39
  • though I have to point out, as some point, most likely before you are out of cores, you will be IO bound not cpu bound – tomc Dec 07 '16 at 03:48
  • 1
    This script has been running for 17 hours and seems to be doing the job in a reasonable amount of time. The output files currently have about 17,000 rows out of 40,000. – Parsa Dec 07 '16 at 11:09
  • @tomc this approach did cross my mind. Perhaps a better approach to multi thread would be to split the work row wise? The work required to implement I thought outweighs the benefit of multithreading. – Parsa Dec 07 '16 at 11:18
  • @user000001 the FINAL file in the split seems to have concatenated all the rows without linebreaks? In terms of bytes it is equal to the others, but just contains one massive row. Any ideas on what could have caused this? – Parsa Dec 07 '16 at 16:49
  • Never mind, figured it out. It's because the number of cols isnt an exact multiple of 500 so for the final file (where there are only 16 cols) it always inserts a SPACE instead of NEW LINE between the rows. Will fix this once the script is done by looping through file and inserting new lines. – Parsa Dec 07 '16 at 17:07
  • @par splitting the work row wise brings all the processors into contention by writing to the same files. column wise each file written is the responsibility of one core. just a consideration – tomc Dec 07 '16 at 18:09
  • @par: I fixed the problem with the final file, but discovered another one: the last field was missing in the last file (before the edit). You'll need to run it again for the last file – user000001 Dec 07 '16 at 19:52
  • @tomc: The best solution would probably be to first split the file by rows to a number of pieces equal to the number of cores, then run the above script for each piece, and finally concatenate the pieces to create the desired files. I'll leave the implementation as an exercise to the reader... – user000001 Dec 07 '16 at 19:55
  • @user000001 thanks for the fix. The last file contained 484 columns. So this means I would rerun the script setting i to NF-484? – Parsa Dec 08 '16 at 10:06
  • @par yes i think so. Note though that one column was missing – user000001 Dec 08 '16 at 10:08
  • @user000001 yep, also apparently it should be NF-483 because the final column will have an i value of NF, so with NF-484 you end up with the last 485 cols. – Parsa Dec 08 '16 at 10:28