I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
- get the content from the old HTML pages using urllib.request
- Use a regular expression to grab the text of HTML
<p>
elements that have classes that identify them as the body of the text - use XML-RPC methods to upload the content to the new WordPress site.
I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.
The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.
Paragraph #1
<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future." So said
bishop-elect H. George Anderson at a news conference immediately following his election as
bishop of the Evangelical Lutheran Church in America. "[The
future] belongs to God, untouched by human hands." At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>
Paragraph #2
<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes"> </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, "For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith"
(2:3-4).<o:p></o:p></span></p>
Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.
The regular expression I have so far is still a work in progress:
post_content_re = re.compile(r'<p class=(body\w*)(.*?>)(<.*?>)*([a-z])', re.IGNORECASE)
My explanation for my regular expression parts:
class=(body\w*)
should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why
(.*?>)
match the remaining attributes in the paragraph element
(<.*?>)*
match 0 or more html elements enclosed in <> after the paragraph element
([a-z])
The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.
The matches I am getting all look like this:
BODY
- but I expectedBODY
orbodyDC
>
- this is the closing > of the p element with class BODY<span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>
- this is the span element after the P elementA
- this is the first letter after the span element
So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.
Thank you for any help.