I'm loading a pretty gigantic file to a postgresql database. To do this I first use split
in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel
and psql copy
.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split
to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel
and it starts loading the files at the time split
finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split
man pages and I can't find anything. Is there a way to do this with split
or any other tool?