2

everyone.

I'm having some difficulties to use regular expressions to grep the text from HTML, which has

</p>

I'm using unsung hero.*</p> to grep the paragraph I'm interested in, but cannot make it match until next </p>

The command I use is:

egrep "unsung hero.*</p>" test

and in test is a webpage like:

<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n    <p>(SOUNDBITE OF MUSIC)</p>\n    <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n    <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n    <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n    <p>(SOUNDBITE OF MUSIC)</p>\n\n    <p class="disclaimer">Copyright &copy; 2019 NPR.  All rights reserved.  Visit our website <a href="https://www.npr.org/about-npr/179876898/terms-of-use">terms of use</a> and <a href="https://www.npr.org/about-npr/179881519/rights-and-permissions-information">permissions</a> pages at <a href="https://www.npr.org">www.npr.org</a> for further information.</p>\n\n    <p class="disclaimer">NPR transcripts are created on a rush deadline by <a href="http://www.verb8tm.com/">Verb8tm, Inc.</a>, an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR&rsquo;s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n      <ul>\n

I'm expecting to match before

</p>\n    <p>If you like

But it actually went way further than that.

I feel like the regular expression I used has issue, but don't know how. Any help will be appreciated.

Thanks!

20190523: Thanks for your guys' suggestions.

I tried

egrep "unsung hero.*?</p>" test

But it didn't give me the result I want, insted it's like test result of .*?

Leo, I feel like this is a useful expression and I'd like to get it right. could you explain a bit?

The other test I did for

[^<]*

Actually gave the result expected enter image description here

justnight
  • 143
  • 1
  • 2
  • 7
  • 1
    A regexp is a "regular expression", not a regression. – Charles Duffy May 22 '19 at 22:08
  • 3
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Joseph Sible-Reinstate Monica May 22 '19 at 22:30
  • ...and I do agree with the flagged duplicate at least in spirit -- `grep` is the wrong tool for this job, until you pair it with something else that can decode the HTML for you. `xmlstarlet -t -m '//p' -v . | grep -Eo '(unsung hero.*)'`, for example, is a lot more reasonable (might need to be paired with some other tooling for HTML-to-XML conversion, but that's readily present on modern UNIX systems too). – Charles Duffy May 22 '19 at 22:34
  • What do you mean when you say it "went way further"? `egrep` prints the entire matching line, so you can't really tell how much of the line it matched. (Unless you use `-o` / `--only-matching`, which is specific to GNU [e]grep.) – Keith Thompson May 22 '19 at 22:50
  • 1
    For more information on parsing HTML with regular expressions: https://stackoverflow.com/a/1732454/827263 – Keith Thompson May 22 '19 at 22:52
  • For more information on parsing HTML with regular expressions: https://regex101.com/r/EBp658/1 –  May 22 '19 at 23:15

1 Answers1

3

With .* the match will be greedy and match the longest substring possible. (Which is in your case until the last paragraph.)

What you actually want is a non-greedy match with .*?

Your specific command should most likely look like this:

grep -P -o "unsung hero.*?</p>" test

Another solution would be to expand your regex until the end of the string/webpage and than pick the selected substring with a group.

UPDATE

As Charles Duffy pointed out correctly, this will not work with the standard (POSIX ERE) syntax. Therefore the command above uses the -P flag to specify that it is a perl regular expression.

If your system or application does not support perl regular expression and you are ok with matching until the first < (instead of matching until the first </p>), matching every character except < is the way to go.

With this, the complete command should look like this:

grep -o "unsung hero[^<]*</p>" test

Thanks to Charles for pointing that out in the comments.

Leo
  • 1,702
  • 13
  • 15
  • 1
    `egrep` [is only guaranteed to support](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) [POSIX ERE](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html). Non-greedy matches are not part of this standard; they're a PCRE extension. Thus, it's safer (aka more likely to work on systems that don't implement extensions) to change the `.*` to `[^<]*`. – Charles Duffy May 22 '19 at 22:30
  • Using `[^<]*` would terminate on any `<` charcacter which is not what the question was for. – Leo May 22 '19 at 22:54
  • It's weird that it didn't work in my local, but it worked when I tried it on www.regexpal.com...Why? – justnight May 24 '19 at 06:50
  • @justnight Did you use `egrep`? As Charles pointed out, it wont work with grep (or possibly other programs) because this solution requires the implementation of non-greedy matches). – Leo May 28 '19 at 09:32
  • @Leo Yes, I used egrep. I added the test to the description. Maybe you can take a look? Thanks. – justnight May 28 '19 at 22:25
  • @justnigh Sorry my bad. It does not work with the ERE regular expression syntax (because it does not support non-greedy matches) and will only work if the -P flag is specified. (`egrep` specifies the `-E` flag and not `-P`. basically `egrep` is a dprecated shortcut for `grep -E`). non-greedy matches are not part of the ERE spec, as Charles pointed out in the first comment. Please check my updated answer with the correction. – Leo May 29 '19 at 11:57