2

I have some text like

<br />
blah
<br />
blah blah

Which im trying to change to:

<p>
blah
</p>
<p>
blah blah
</p>

I have the following regex

newContent = re.sub("<br />(?=(.*(<br />)?\n)<br />)","<p>",newContent)

But this isn't going to work how I want. I want anything before the look forward to be replaced with <p> and after the look forward to be replaced with </p>

Is this possible?

Martyn
  • 806
  • 1
  • 8
  • 20
  • 2
    Word of warning: [Regex is not a tool that can be used to correctly parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – jbabey Oct 25 '13 at 13:48
  • 1
    Another day. Another chance to post this explanation: http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Cfreak Oct 25 '13 at 13:50
  • This question appears to be off-topic because it is [about parsing HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)! –  Oct 25 '13 at 13:59
  • Act like if I didn't give you a [regex solution](http://regex101.com/r/sV1mL5) :P `
    (.*?)(?=
    |$)`, replace with `

    \1

    `
    – HamZa Oct 25 '13 at 14:06

3 Answers3

3

Listen for those guys who suggest you to use a html parser, like :

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('htmlfile', 'r'), 'html')

for br in soup.find_all('br'):
    p = soup.new_tag('p')
    p.string = br.next_sibling.extract()
    br.replace_with(p)

print(soup.prettify())

Run it like:

python3 script.py

That yields:

<html>
 <body>
  <p>
   blah
  </p>
  <p>
   blah blah
  </p>
 </body>
</html>
Birei
  • 35,723
  • 2
  • 77
  • 82
1

You can't do that with regexes, because they can replace only text pieces in place, not propagating results further. All you can do is only some workarounds like this one:

 s = "html code"
 s = s.split("<br />");
 s = "<p>" + "</p><p>".join(s) + "</p>"
Roman Dobrovenskii
  • 935
  • 10
  • 23
1

It is simple regular expression, no need for splitting and BeautifulSoup.

import re
t = '(.+)(blah)(.+)(blah blah)'
r = r"""<p>
\2
</p>
<p>
\4
</p>
"""
s = """<br />
blah
<br />
blah blah
"""
print(re.sub(t, r, s, flags=re.S))

It gives

<p>
blah
</p>
<p>
blah blah
</p>
peroksid
  • 890
  • 6
  • 17