Regex to capitalize paragraphs in HTML python

Question

I want to take everything in an HTML document and capitalize the sentences (within paragraph tags). The input file has everything in all caps.

My attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups. I don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.

There may be a much easier way to do this than regex, too. Here's what I have:

import re

def replace(match):
    return match.group(1).capitalize()

with open('explanation.html', 'rbU') as inf:
    with open('out.html', 'wb') as outf:
        cont = inf.read()
        par = re.compile(r'(?s)\<p(.*?)\<\/p')
        s = re.sub(par, replace, cont)
        outf.write(s)

First, do not use a regex to extract content of p tags, use Beautifulsoup. — Casimir et Hippolyte, Sep 09 '15 at 18:43
I'm just trying to do something quick and simple for a one-time use. I don't normally touch HTML. — Xodarap777, Sep 09 '15 at 18:44
Never use regex on HTML/XML a StackOverflow user went insane due to this same situation. http://stackoverflow.com/a/1732454/1066393 — ZaxLofful, Sep 09 '15 at 18:45
About `capitalize()`, it doesn't care what is a sentence or not, it will make uppercase the first letter. A possible way to extract sentences is to use nltk. — Casimir et Hippolyte, Sep 09 '15 at 18:47

Casimir et Hippolyte · Accepted Answer · 2015-09-09T19:23:49.303

3

An example with beautifulsoup and nltk:

from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup

html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
 i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for paragraph in soup.find_all('p'):
    text = paragraph.get_text()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
    paragraph.string = "\n".join(sents)

print(soup)

edited Sep 09 '15 at 19:23

answered Sep 09 '15 at 19:14

Casimir et Hippolyte

88,009
5
94
125

How do I write that to a string? It says it must be a buffer, not BeautifulSoup. – Xodarap777 Sep 09 '15 at 20:20
To target a specific location (the paragraph before, after, inside... with attribute x,y or z), you can use css selectors with beautifulsoup (see the doc). Another possible way is to use `lxml` module instead of `beautifulsoup` that allows to use xpath queries. – Casimir et Hippolyte Sep 09 '15 at 20:30
If you want the final result as a string, you only need `str(soup)` – Casimir et Hippolyte Sep 09 '15 at 20:39

Regex to capitalize paragraphs in HTML python

1 Answers1