Using GNU Parallel With Split

Question

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?

@KevinDTimm The verbose option prints when it starts to create the file, not when it finishes — Topo, Feb 28 '13 at 20:47
@KevinDTimm But I want to avoid the need to have an extra script to do that. — Topo, Mar 03 '13 at 03:53

Thor · Accepted Answer · 2019-10-21T18:13:29.287

32

You could let parallel do the splitting:

<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh

Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:

<2011.psv parallel --pipe --block 250M ./carga_postgres.sh

Testing `--pipe` and `-N`

Here's a test that splits a sequence of 100 numbers into 5 files:

seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'

Check result:

wc -l /tmp/parallel_test_[1-5]

Output:

 23 /tmp/parallel_test_1
 23 /tmp/parallel_test_2
 23 /tmp/parallel_test_3
 23 /tmp/parallel_test_4
  8 /tmp/parallel_test_5
100 total

edited Oct 21 '19 at 18:13

answered Feb 28 '13 at 20:48

Thor

45,082
11
119
130

The `--pipe -N 50000000` options of parallel will send the 50000000 lines to `carga_postgres.sh` stdin? – Topo Feb 28 '13 at 21:00
@Topo: Yes that's correct. I've edited the answer to illustrate how `--pipe` and `-N` work. – Thor Feb 28 '13 at 21:54
1

Since you input is gigantic, you may want to look at --joblog --resume --resume-failed (requires version 20130222). – Ole Tange Feb 28 '13 at 22:55
@Thor I really liked your answer, but when I ran it, zsh gives me an error, something like: `zsh: 4 expected number` and I just can figure out what the problem is. – Topo Mar 01 '13 at 06:13
@Topo: I also use `zsh` but am unable to reproduce this error. Does the test example work? What versions of `parallel` and `zsh` are you using? Does it help if you run this under `zsh -f`? – Thor Mar 01 '13 at 07:38
@Thor The test example did work, about the other info, I'll have to wait til tomorrow to post. – Topo Mar 01 '13 at 07:40
Parallel is exactly what I needed, the machine I'm working on doesn't support --filter in split, and parallel execution is great too! – Matt Sep 14 '17 at 15:01

Olaf Dietsche · Answer 2 · 2013-02-28T22:37:17.660

3

If you use GNU split, you can do this with the --filter option

‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.

You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background

#! /bin/sh

cat >$FILE
./carga_postgres.sh $FILE &

and use that script as the filter

split -l 50000000 --filter=./filter.sh 2011.psv

edited Feb 28 '13 at 22:37

answered Feb 28 '13 at 20:49

Olaf Dietsche

72,253
8
102
198

What `cat` is doing in you example, is writing each line that `split` sends through stdin to the file `$FILE` and then pass `$FILE` file name to carga_postgres? – Topo Feb 28 '13 at 21:05
@Topo Correct, split sends the lines to the filter's stdin. `$FILE` is the name split has chosen. You are free to use another unique name, of course. – Olaf Dietsche Feb 28 '13 at 21:08
I've been doing some little (just execute a `head -1` in every file) tests and when I add the `&` at the end of the filter flag, the files generated for the `split` are empty and the command `head` doesn`t display anything... but if I execute the `split` without the `&` the split-generated files are normal and `head` prints to the screen ... Is this the correct behaviour or I am doing a terrible interpretation? – Topo Feb 28 '13 at 22:20
1

@Topo I just tested this and stdin is indeed redirected from `/dev/null`, if you run the filter with `&`. But if you run the filter without, split will wait until the filter exits, before starting the next filter. So it seems, you must use the script version in order to run carga_postgres.sh in parallel. I updated my answer accordingly. – Olaf Dietsche Feb 28 '13 at 22:36

Using GNU Parallel With Split

2 Answers2

Testing `--pipe` and `-N`

Linked

Using GNU Parallel With Split

2 Answers2

Testing --pipe and -N

Linked

Testing `--pipe` and `-N`