so I've been working on a web crawler to parse out readable contents from a news site I like, and I've been using regex pretty heavily in python2. I visited https://regexr.com/ to double check that I had the correct expression for this use case, but I keep getting different results than expected, specifically when I cross reference the output from regexr. Here is the expression
re.compile(ur"[\s\S\]*<p.*>([\s\S]+?)<\/p>")
And here is the html I am attempting to match
</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic ...
It doesn't end up getting closed for some time, but the program doesn't grab this section at all, and only after the in
ygen levels</a>, and even blood pressure.</p>
does it begin to grab the html (EDIT: p elements). I guess I am confused by the inconsistencies with different regex engines, and I am trying to figure out when and where to modify my syntax, in this case to grab the entire p element, but also generally. This is my first time posting here, so I may have this formatted incorrectly, but thank you all in advance. Been lurking for a while now.
]*>`. Also, you have an unclosed opening `(` in the pattern.
– marekful Nov 10 '17 at 03:16]*> edit to the expression did the trick! Thanks so much friend
– lhubbard01 Nov 10 '17 at 03:59