-1

I want to take everything in an HTML document and capitalize the sentences (within paragraph tags). The input file has everything in all caps.

My attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups. I don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.

There may be a much easier way to do this than regex, too. Here's what I have:

import re

def replace(match):
    return match.group(1).capitalize()

with open('explanation.html', 'rbU') as inf:
    with open('out.html', 'wb') as outf:
        cont = inf.read()
        par = re.compile(r'(?s)\<p(.*?)\<\/p')
        s = re.sub(par, replace, cont)
        outf.write(s)
Xodarap777
  • 1,358
  • 4
  • 19
  • 42

1 Answers1

3

An example with beautifulsoup and nltk:

from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup

html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
 i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for paragraph in soup.find_all('p'):
    text = paragraph.get_text()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
    paragraph.string = "\n".join(sents)

print(soup)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • How do I write that to a string? It says it must be a buffer, not BeautifulSoup. – Xodarap777 Sep 09 '15 at 20:20
  • To target a specific location (the paragraph before, after, inside... with attribute x,y or z), you can use css selectors with beautifulsoup (see the doc). Another possible way is to use `lxml` module instead of `beautifulsoup` that allows to use xpath queries. – Casimir et Hippolyte Sep 09 '15 at 20:30
  • If you want the final result as a string, you only need `str(soup)` – Casimir et Hippolyte Sep 09 '15 at 20:39