0
<block id="123">
    <othertag1>...</othertag1>
    <othertag2>...</othertag2>
    <picture>...</picture>
    <othertag3>...</othertag3>
    <othertag4>...</othertag3>
</block>

How using ag or grep find such blocks in many files, that have no <picture> tag?

And advanced: get "id" from <block> tag for those. (For example output them as a list to stdout).

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
vladon
  • 8,158
  • 2
  • 47
  • 91
  • **Don't use regular expressions to parse HTML**. Use a dedicated parsing tool in the programming language of your choice. Regexes have no concept of things like "blocks". – Andy Lester Sep 28 '16 at 14:53

2 Answers2

1

Yes, you could use your hammer to drive in that screw.
I'm going to recommend a screwdriver though.
By which I mean that I would use the tool that was made the solve it: XPath!

/block[not(picture)]

For the stretch goal:

/block[not(picture)]/@id

If you're going to parse XML, you should use XPath.

Task
  • 3,668
  • 1
  • 21
  • 32
  • Hmm, I haven't wanted to do that before. Perhaps like this? http://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell – Task Sep 28 '16 at 16:12
  • Thank you, I had no time to wait, so I write simpliest multi-threaded sax-like parser in c++ :) 10 minutes for parsing 250 GB of total 21000 files. – vladon Sep 29 '16 at 07:23
0

If you must, you can use a pearl regex where the dot match includes newlines. For example, using ag:

ag '(?s)<block(?!.*?picture).*?</block>'

This will return the contents between the block tags which span multiple lines while excluding blocks which contain the picture tag between those two block tags.

The (?s) means the . matches include newlines. The ?! is a negative look ahead, in this case for the word 'picture' The *? is a non-greedy search until the first picture and first block.

Note: I'm sure there are corner cases where this search pattern won't work, but my quick test worked well.

If you wish to further limit the results to just the IDs, pipe another ag to your result:

ag '(?s)<block(?!.*?picture).*?</block>' <directory with files> | ag -o 'id="([0-9]+)"' 
gregory
  • 10,969
  • 2
  • 30
  • 42