1

So a naive me wanted to parse 50 files using awk, so I did the following

zcat dir_with_50files/* > huge_file
cat huge_file | awk '{parsing}'

Of course, this was terrible because it would spend time creating a file, then consume a whole bunch of memory to pass along to awk.

Then a coworker showed me that I could do this.

zcat dir_with_50files/filename{0..50} | awk '{parsing}'

I was amazed that I would get the same results without the memory consumption. ps aux also showed that the two commands ran in parallel. I was confused about what was happening and this SO answer partially answered my question.

https://stackoverflow.com/a/1072251/6719378

But if piping knows to initiate the second command after certain amount of buffered data, why does my naive approach consume so much more memory compared to the second approach? Is it because I am using cat on a single file compared to loading multiple files?

Community
  • 1
  • 1
dorachan2010
  • 981
  • 3
  • 12
  • 21
  • 1
    Not sure I understand the question. You created a file. A file uses memory. Is that the answer? btw see https://www.google.com/search?q=uuoc&ie=utf-8&oe=utf-8 then consider `awk 'script' file` rather than `cat file | awk 'script'`. – Ed Morton Dec 02 '16 at 14:40
  • 1
    How are you measuring memory usage? Are you sure that your memory usage isn't just [disk cache](http://www.linuxatemyram.com/)? (Try `free -m`) – e0k Dec 03 '16 at 00:29
  • @EdMorton He is using [`zcat`](https://linux.die.net/man/1/zcat), which decompresses a gzipped file before `cat`-ing it. I don't think `awk` can do decompression of input files like that. – e0k Dec 03 '16 at 00:43
  • 1
    @e0k not in the code I'm referring to he isn't: `cat huge_file | awk '{parsing}'` – Ed Morton Dec 03 '16 at 04:50
  • It is not because you are using cat on a single file compared to loading multiple files. `zcat` in both the first and second case are only reading one .gz file at a time. – webb Dec 05 '16 at 22:50
  • zcat has to do the same amount of work in either case, gunzip all the files. In the first case awk does not begin till after all the files are decoded, in the second case awk is free to start on the first file before it has even finished being decoded. – tomc Dec 11 '16 at 09:11
  • Are you mixing up memory and disk space here? – tommy.carstensen Dec 14 '16 at 00:05

1 Answers1

0

you can reduce maximuml memory usage by zcat file by file

ex:

for f in dir_with_50files/* 
 do
    zcat f | awk '{parsing}' >> Result.File
 done

# or

find dir_with_50files/ -exec zcat {} | awk '{parsing}' >> Result.File \;

but it depend on your parsing

  • ok for modfying line, deleting, copying if there is no relation to previous items ( ex: sub( /foo/, "bar"))
  • bad for counter (ex: List[$2]++ ) or related (modification) (ex: NR != FNR {...}; ! List[$2]++ {...})
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43