2

I am trying to execute a command like this:

find ./ -name "*.gz" -print -exec ./extract.sh {} \;

The gz files themselves are small. Currently my extract.sh contains the following:

# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info

Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?

I have 80K gz files on a machine with massive horse power of 32 cores.

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
Legend
  • 113,822
  • 119
  • 272
  • 400
  • 1
    Is there any reason that you couldn't use a separate temp file for each file you're extracting? –  Mar 22 '12 at 19:13
  • 2
    uuoc: `cat temp | grep` is redundant, grep accepts input files as an argument. – Chris Browne Mar 22 '12 at 19:15
  • I wanted a single aggregate file in the end and am looking for the most efficient way of doing this. In my case, this would mean creating an additional 80K files. Do you think it won't matter? – Legend Mar 22 '12 at 19:16
  • 1
    @ChrisBrowne: Yeap. Just changed it. – Legend Mar 22 '12 at 19:16
  • 1
    Now that I think about it, couldn't you just `zcat $1 | grep blah`? –  Mar 22 '12 at 19:17
  • Maybe you can write a daemon script that tries to merge any two files in the output directory. If order is necessary, number the generated `Info` files. – Reci Mar 22 '12 at 19:18
  • @K.G.: Actually I have multiple greps and am doing some post-processing to the output so I did not want to do multiple `zcat`s which is why I created a temp file first. I would be open to any improvements though. – Legend Mar 22 '12 at 19:18
  • Well, if you're looking for multiple patterns, then you can always `grep -e thing1 -e thing2` to find both. Have a look at [this question](http://stackoverflow.com/questions/307015/how-do-i-include-a-pipe-in-my-linux-find-exec-command). Based on what you have here, I believe you could do away with `extract.sh` entirely by using unnamed pipes. –  Mar 22 '12 at 19:24
  • 3
    creation of the temp file is redundant. You can use tee to pipe output to multiple processes (`cat file | tee >(./process1.sh) >(./process2.sh)`. But using a temp file is more readable and there are functions to help you create unique temp files. – Dunes Mar 22 '12 at 19:24
  • @K.G.: Thank You. I am now making another attempt at this. In the question that you linked, the third comment by RolfWRasmussen is the problem I am facing. When I pipe the output to one file even when using xargs, the output gets interleaved. – Legend Mar 22 '12 at 19:29
  • @K.G.: Also, is there a way I can maintain order when `grep`ping for multiple patterns? This is the only reason why I am doing multiple separate greps. If there is a way to say: `grep -e '(1)..." -e "(2)..."` or something close, that would be great. – Legend Mar 22 '12 at 19:46
  • You can pipe the output of your grep to `sort`. You're still (likely) going to run into the issue of interleaving your results if you're running this in parallel. –  Mar 22 '12 at 19:58

5 Answers5

1

Assume (just for simplicity and clearness) all your files starts with a-z.

So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file

find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;

(extract needs to take to first parameter to separate the "info" destination file)

When you want a big aggregate file just joins all aggregate.

However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.

Probably hard disk head movement will be the limitation not the unzip (cpu) performance.

But let's try

stefan bachert
  • 9,413
  • 4
  • 33
  • 40
  • +1 for the idea. However, I somehow get the feeling that there is a one-liner solution that I am missing. – Legend Mar 22 '12 at 19:48
1

A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.

The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.

To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.

If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.

Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.

To use:

  1. Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
  2. Edit /etc/fstab as root
  3. Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
  4. Run umount as root to mount your new ramdrive. It will also mount at boot.

See the wikipedia entry on fstab for all the options available.

Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73
  • +1 Thank You. Could you please elaborate on your third paragraph? I am currently looking into the tmpfs. – Legend Mar 22 '12 at 19:47
  • @Legend sure. For a tiny example project using tmpfs, see my port of the [ffox-daemon](https://github.com/srathbun/firefox-tmpfs-daemon) to debian. I'll also edit my answer with details. – Spencer Rathbun Mar 22 '12 at 19:53
1

You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):

find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh

In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:

# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip

# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp

If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.

Anyway, later combine everything into a single file:

cat /tmp/Info.* > Info
rm /tmp/Info.*

If you care about order of .gz files apply second argument to ./extract.sh:

find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...

And in ./extract.sh:

tmp=`mktemp -t Info.$1.XXXXXX`
src=$2
Bartosz Moczulski
  • 1,209
  • 9
  • 18
0

I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.

You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).

Example:

working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh 
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"

extract.sh

 filename=$(basename $1)
 output="$filename.output"
 extracted="$filename.extracted"
 zcat "$1" > "$extracted"

 echo "#####" $filename > "$output"
 # Series of greps to extract some useful information
 grep -o -P "..." "$extracted" >> "$output"
 grep -o -P "..." "$extracted" >> "$output"
 rm "$extracted"
 echo "####" >> "$output"
Legend
  • 113,822
  • 119
  • 272
  • 400
Dunes
  • 37,291
  • 7
  • 81
  • 97
  • I wonder why this answer was downvoted. Also, there is a slight problem with this. When `output="$1.output"` it is not creating the tmp files in the tmp directory but rather in the original directory. – Legend Mar 22 '12 at 20:34
  • Modified it to a working version by using `$(basename $1)` to extract the filename. – Legend Mar 22 '12 at 20:39
0

The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.

#!/bin/sh

for f; do
    zcat "$f" |
    perl -ne '
        /(pattern1)/ && push @pat1, $1;
        /(pattern2)/ && push @pat2, $1;
        # ...
        END { print "##### '"$1"'\n";
            print join ("\n", @pat1), "\n";
            print join ("\n", @pat2), "\n";
            # ...
            print "#### '"$f"'\n"; }'
done

Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.

The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.

Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

tripleee
  • 175,061
  • 34
  • 275
  • 318