I have a directory with ~50k files. Each file has ~700000 lines. I have written an awk program to read each line and print only if there is an error. Everything is running perfectly fine, but the time taken is huge - ~4 days!!!! Is there a way to reduce this time? Can we use multiple cores (processes)? Did anyone try this before?
Asked
Active
Viewed 4,927 times
2
-
You are printing only if there's an error? Does this consist simply of looking for a certain pattern in the file? Did you check CPU usage to verify that you are actually CPU bound? – nneonneo Apr 09 '13 at 04:53
-
Yes. Printing only if there is an error - just to reduce IO. I have not checked CPU usage. Even if it is not CPU intensive, how can we prallelize this operation? – Sandhya Prema Apr 09 '13 at 04:55
-
It sounds like you're I/O bound, in which case multiple processes will probably not help. Also, consider using `grep` if you are just looking for errors. – nneonneo Apr 09 '13 at 04:57
-
possible duplicate of [Parallelize Bash Script](http://stackoverflow.com/questions/38160/parallelize-bash-script) – nneonneo Apr 09 '13 at 04:59
-
I can not use grep 'coz some of the comparisons are dependent on parameter values spread across multiple lines in same file. But let me look at the other pointer you have provided. Thanks for the help. I will come back after trying to call this awk script from bash multiple times :) – Sandhya Prema Apr 09 '13 at 05:03
1 Answers
2
awk
and gawk
will not fix this for your by themselves. There is no magic "make it parallel" switch. You will need to rewrite to some degree:
- shard by file - the simplest way to fix this is to run multiple awks' in parallel, one per file. You will need some sort of dispatch mechanism. Parallelize Bash script with maximum number of processes shows how you can write this yourself in shell. It will take more reading, but if you want more features check out gearman or celery which should be adaptable to your problem
- better hardware - it sounds like your probably need a faster CPU to make this go faster, but it could also be an I/O issue. Having graphs of CPU and I/O from munin or some other monitoring system would help isolate which is the bottleneck in this case. Have you tried running this job on an SSD based system? That is often an easy win these days.
- caching - there are probably some amount of duplicate lines or files. If there are enough duplicates it would be helpful to cache the processing in some way. If you calculate the CRC/
md5sum
for a file and store it in a database you could calculate the md5sum for a new file and skip processing if you've already done so. - complete rewrite - scaling this with
awk
is going to get ridiculous at some point. Using some map-reduce framework might be a good idea.

chicks
- 2,393
- 3
- 24
- 40