1

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.

  1. Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
  2. The resulting .txt files must not exceed a predefined limit, say, one million lines.

That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.

I tried to make use of something like

gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000 

But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.

Community
  • 1
  • 1
lyrically wicked
  • 1,185
  • 12
  • 26

3 Answers3

3

You can perhaps use zgrep.

zgrep [ grep_options ] [ -e ] pattern filename.gz ...

NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.

anishsane
  • 20,270
  • 5
  • 40
  • 73
  • Oops! I forgot the good old `zgrep`! Thanks for reminding! Actually with `zgrep` you can spare some resources (at least one `fork` and `exec`). – TrueY Jan 08 '15 at 09:41
2

I'm afraid It's imposible, quote from gzip man:

If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip.

UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:

gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'

split is also a good choice but you will have to rename files after it.

Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
  • The problem can be with this solution that [tag:awk] use ERE and not [tag:perl] style regex. They have fairly different syntax, so most probably `my regex` will not work in [tag:awk] except for very basic patterns. – TrueY Jan 08 '15 at 08:51
  • @TrueY Well, that's OK, awk regexes will be enough – lyrically wicked Jan 08 '15 at 09:15
  • @TrueY EREs are far from `vary basic patterns`. Maybe you're thinking of BREs as supported by default by sed. – Ed Morton Jan 08 '15 at 15:14
  • @EdMorton: Sorry if I was not clear! I meant ERE (used by `awk` or `grep -E`) uses a very different syntax then [tag:perl]'s RE (also used by `grep -P`). But simple patterns can work on both (like: "my pattern"). Op specified `-P` in the Q, so I tired to emphasize that [tag:awk] can fail if [tag:perl] style RE is used. – TrueY Jan 08 '15 at 15:59
1

Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:

gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000 

But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:

gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file

In this case the result files will look like: file01, file02, fil03 etc.

If You want to filter out some not matching style regex, you can try something like this:

gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file

I hope this helps.

TrueY
  • 7,360
  • 1
  • 41
  • 46
  • Does grep allow to filter lines matching the first regex **AND** not matching the second? – lyrically wicked Jan 08 '15 at 08:15
  • You can use `grep` and then `grep -v`. – TrueY Jan 08 '15 at 08:44
  • @lyricallywicked Or even You can use look-ahead and look-behind buffers supported by [tag:perl] regex. – TrueY Jan 08 '15 at 08:55
  • 1
    "You can use grep and then grep -v" - requires a significant amount of additional time, because you can't do something like `--regex="first"&&!"second"` in one single command, right? – lyrically wicked Jan 08 '15 at 09:00
  • "You can use look-ahead and look-behind buffers" - Some time ago, I tried to use regex-only way to filter something **NOT** containing something. All I remember is that was **too** slow. See [Regular expression to match string not containing a word?](http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word) – lyrically wicked Jan 08 '15 at 09:11
  • @lyricallywicked: Sure, using look-* buffers can be slow. That's why I suggested `grep -v`. – TrueY Jan 08 '15 at 09:26