I'm working on optimizing a Python script that needs to parse a huge (12 TB) amount of data. At the moment, it basically looks like:
gzip -d -c big_file.gz | sed /regex|of|interesting|things/p | script.py
(actually, the piping is being done by subprocess.Popen
, but I don't think that's important -- correct me if I'm wrong.)
It appears that the gzip->sed->python
pipes are currently the most time consuming part of the script. I assume that this is because there are three separate processes in play here: since none of them can have a shared address space, any information that needs to be passed between them needs to actually be copied from one to the other, so the three pipes result in a total of at most 36 TB being pushed through my RAM rather than just 12.
Am I understanding correctly what's going on?