Python regular expression grabbing paragraphs from old HTML

Question

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to

get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML  elements that have classes that identify them as the body of the text
use XML-RPC methods to upload the content to the new WordPress site.

I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.

The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.

Paragraph #1

<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future.&quot; So said
bishop-elect H. George Anderson at a news conference immediately following his election as 
bishop of the Evangelical Lutheran Church in America. &quot;[The
future] belongs to God, untouched by human hands.&quot; At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>

Paragraph #2

<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes">  </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, &quot;For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith&quot;
(2:3-4).<o:p></o:p></span></p>

Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.

The regular expression I have so far is still a work in progress: post_content_re = re.compile(r')(<.*?>)*([a-z])', re.IGNORECASE)

My explanation for my regular expression parts: class=(body\w*) should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why

(.*?>) match the remaining attributes in the paragraph element

(<.*?>)* match 0 or more html elements enclosed in <> after the paragraph element

([a-z]) The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.

The matches I am getting all look like this:

BODY- but I expected BODY or bodyDC
> - this is the closing > of the p element with class BODY
 - this is the span element after the P element
A - this is the first letter after the span element

So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.

Thank you for any help.

Please have a look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ;-) — Klaus-Dieter Warzecha, May 14 '16 at 19:12
This would not be difficult if you used BeautifulSoup or LXML - there are a number of questions already addressing this problem - I would scan those for a solution. — PyNEwbie, May 14 '16 at 20:10

kuruczgyurci · Answer 1 · 2016-05-15T16:55:26.040

While (as someone commented) you should not parse HTML like this, for this one-off job this kind of solution might just work.

Your regex is not working for the first paragraph because . does not match newlines, and you have a newline inside your tag. You can use tricks like [\S\s] to match all characters, including newlines.

This one does not remove the tags at the end of the paragraph, but I hope it still helps:

for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
    print content

Bit of explanation:

<]*> matches the opening paragraph tag
<p: the beginning of the tag
(class=bodyDC|class=BODY): one of the two class attributes
[^><]*: any other attributes inside the tag
>: the end of the tag

(<[\S\s]*?>)* matches any number of tags
<: the beginning of the tag
[\S\s]*?: any other attributes (could have also used [^><]*)
>: end of tag

([\S\s]*?) matches any text. This is group 3, this is basically the content. (Plus the tags at the end of it.)

<\/p> matches the closing paragraph tag. (Note that in the code it actually appears as <\\/p>, because the backslash has to be escaped in the python string.)

In paragraph 1, the `` tag contains a line break right after the semicolon. In paragraph 2, this exact same tag is on a singe line. — kuruczgyurci, May 15 '16 at 15:28
Dang I don't know how I missed that. Thanks. I'm still trying to understand your regex. — LeonardShelby, May 15 '16 at 16:17
If I used a raw string ( `r'string'` )for your pattern, would I not need to double escape the ? As you have it now, it is <\\/p>. With a raw string, would it be <\/p> with one escape? Or just with no escapes? — LeonardShelby, May 16 '16 at 01:05

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

I would follow a two step approach to this problem.

first collect all the paragraphs of interest
second extract the text from each paragraph

First

Parse out all the paragraphs that have the desired class.

<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)

This regex will do the following:

find all the paragraph tags of the given class upto but not including the close 
avoids some odd edge cases problems like  ">
due to regex limitations this will not work with nested paragraph tags like outside paragraphinside paragraphmore text in the outside

See Live Demo

Second

Extract the raw text from each paragraph

(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)

This regex will do the following:

match both the raw text and tags
place the raw text into capture group 1
avoid difficult edge cases

See Live Demo

Python regular expression grabbing paragraphs from old HTML

2 Answers2

First

Second