0

so I've been working on a web crawler to parse out readable contents from a news site I like, and I've been using regex pretty heavily in python2. I visited https://regexr.com/ to double check that I had the correct expression for this use case, but I keep getting different results than expected, specifically when I cross reference the output from regexr. Here is the expression

re.compile(ur"[\s\S\]*<p.*>([\s\S]+?)<\/p>")

And here is the html I am attempting to match

</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic ...

It doesn't end up getting closed for some time, but the program doesn't grab this section at all, and only after the in

ygen levels</a>, and even blood pressure.</p>

does it begin to grab the html (EDIT: p elements). I guess I am confused by the inconsistencies with different regex engines, and I am trying to figure out when and where to modify my syntax, in this case to grab the entire p element, but also generally. This is my first time posting here, so I may have this formatted incorrectly, but thank you all in advance. Been lurking for a while now.

lhubbard01
  • 13
  • 4
  • 1
    You explanation is quite obscure. It wouldn't harm if you attached your example on regexr.com (or on regex101.com for that matter). Hint: try to capture tag endings be excluding it first, e.g. `

    ]*>`. Also, you have an unclosed opening `(` in the pattern.

    – marekful Nov 10 '17 at 03:16
  • Why are you trying to parse html with regex...? Probably [not a great idea](https://stackoverflow.com/a/1732454/4799172). – roganjosh Nov 10 '17 at 03:32
  • Ah, so that functions to check if anything else exists first before we close the bracket in the beginning of the pattern? And the expression used here is the same as the one I tested on regexr.com. And thank you for the regex101.com reference, I'll have to check it out. Any other tips to make my post a bit more concrete? – lhubbard01 Nov 10 '17 at 03:33
  • @roganjosh Sorry I guess parse is the wrong word. I meant piece out. I'm trying to capture all elements I want from each line of html. – lhubbard01 Nov 10 '17 at 03:35
  • See the link I gave in my edited comment. Then try `BeautifulSoup` instead unless you have a compelling reason not to. I'm not sure you do in this case. – roganjosh Nov 10 '17 at 03:36
  • 1
    @roganjosh made me crack up. I've used BeautifulSoup before, but I wanted to get more hands on with it. It felt like I was cheating working on that layer of abstraction, and I really wanted to see if I could do what I needed with just my own expressions, but it turns out it was pretty naive. I've also had BeautifulSoup act irregular when I've used it before, but that may have been my fault, so I'll have to give it another shot. Thanks! – lhubbard01 Nov 10 '17 at 03:39
  • @marekful, your

    ]*> edit to the expression did the trick! Thanks so much friend

    – lhubbard01 Nov 10 '17 at 03:59

3 Answers3

0

Perhaps it's because you don't have a closing parenthesis ) in your regular expression?

Try starting with this, then build it out:

import re

s = """</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic</p>"""

r = re.compile(r"<p>([\w\W ]*)</p>")

a = r.search(s)
print(a.group(1))

Note that you don't have to escape the forward slash.

Dewald Abrie
  • 1,392
  • 9
  • 21
  • Oh my apologies, I'll edit that. There actually is. – lhubbard01 Nov 10 '17 at 03:28
  • I updated the post to give you a working example to get you started. – Dewald Abrie Nov 10 '17 at 03:39
  • Ah thanks. Yeah I ended up trying it again, and it didn't make a notable difference, but I think your expression fits the use case better. I think it ends up being an issue of html being god-awful to run regular expressions over. I also wanted to capture anything inside the p element in a non-greedy fashion. May have been unclear – lhubbard01 Nov 10 '17 at 03:44
  • Cool, hope that gets you sorted. Remember to mark the right answer. – Dewald Abrie Nov 10 '17 at 03:52
  • Thank you! EDIT: And in my previous comment where I said non-greedy fashion, I actually meant greedy. – lhubbard01 Nov 10 '17 at 03:54
0

The expression [\s\S]* will match everything, and so will go straight past the beginning of the tag.

Within the tag, your expression p.* is greedy, and will not stop at the nearest closing bracket. Use .*? for non-greedy.

You seem to have a number of other syntax errors in the regex also. Cut and paste a valid regex.

In general it much easier and less error-prone to use a proper HTML parsing library, even for quite simple tasks. See for example the parsers in lxml.

pwray
  • 1,075
  • 1
  • 10
  • 19
  • Thank you, I'll have to check that out. I may have been ambiguous about the use case, I was also splitting the html response every newline so I was using a catch all in order to compensate for the inconsistencies I was experiencing between the test engine and my python2 version. The greediness was a bit intended. Thank you for the suggestions though, it can definitely be improved – lhubbard01 Nov 10 '17 at 03:51
0

In this case, I ended up getting the response I desired with @marekful 's expression substituted into the regex mentioned in the post. Thank you all for the assistance!

re.compile(ur"[\s\S\]*?<p[^>]*>([\w\W])*</\p>")
lhubbard01
  • 13
  • 4