My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

Question

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.

Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.

That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.

I tried to make use of something like

gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000

But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.

score 3 · Answer 1 · answered Jan 08 '15 at 09:28

3

You can perhaps use zgrep.

zgrep [ grep_options ] [ -e ] pattern filename.gz ...

NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.

answered Jan 08 '15 at 09:28

anishsane

20,270
5
40
73

Oops! I forgot the good old `zgrep`! Thanks for reminding! Actually with `zgrep` you can spare some resources (at least one `fork` and `exec`). – TrueY Jan 08 '15 at 09:41

Juan Diego Godoy Robles · Accepted Answer · 2015-01-08T08:04:10.700

2

I'm afraid It's imposible, quote from gzip man:

If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip.

UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:

gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'

split is also a good choice but you will have to rename files after it.

edited Jan 08 '15 at 08:04

answered Jan 08 '15 at 07:29

Juan Diego Godoy Robles

14,447
2
38
52

The problem can be with this solution that [tag:awk] use ERE and not [tag:perl] style regex. They have fairly different syntax, so most probably `my regex` will not work in [tag:awk] except for very basic patterns. – TrueY Jan 08 '15 at 08:51
@TrueY Well, that's OK, awk regexes will be enough – lyrically wicked Jan 08 '15 at 09:15
@TrueY EREs are far from `vary basic patterns`. Maybe you're thinking of BREs as supported by default by sed. – Ed Morton Jan 08 '15 at 15:14
@EdMorton: Sorry if I was not clear! I meant ERE (used by `awk` or `grep -E`) uses a very different syntax then [tag:perl]'s RE (also used by `grep -P`). But simple patterns can work on both (like: "my pattern"). Op specified `-P` in the Q, so I tired to emphasize that [tag:awk] can fail if [tag:perl] style RE is used. – TrueY Jan 08 '15 at 15:59

TrueY · Answer 3 · 2015-01-08T08:57:53.710

1

Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:

gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000

But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:

gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file

In this case the result files will look like: file01, file02, fil03 etc.

If You want to filter out some not matching perl style regex, you can try something like this:

gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file

I hope this helps.

edited Jan 08 '15 at 08:57

answered Jan 08 '15 at 08:04

TrueY

7,360
1
41
46

Does grep allow to filter lines matching the first regex **AND** not matching the second? – lyrically wicked Jan 08 '15 at 08:15
You can use `grep` and then `grep -v`. – TrueY Jan 08 '15 at 08:44
@lyricallywicked Or even You can use look-ahead and look-behind buffers supported by [tag:perl] regex. – TrueY Jan 08 '15 at 08:55
1

"You can use grep and then grep -v" - requires a significant amount of additional time, because you can't do something like `--regex="first"&&!"second"` in one single command, right? – lyrically wicked Jan 08 '15 at 09:00
"You can use look-ahead and look-behind buffers" - Some time ago, I tried to use regex-only way to filter something **NOT** containing something. All I remember is that was **too** slow. See [Regular expression to match string not containing a word?](http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word) – lyrically wicked Jan 08 '15 at 09:11
@lyricallywicked: Sure, using look-* buffers can be slow. That's why I suggested `grep -v`. – TrueY Jan 08 '15 at 09:26

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

3 Answers3

Linked