How to use regular expressions on a .html file in python?

Question

I'm very new to programming and I would really appreciate any help! I am trying to write this little python script:

I have an .html file of a legal codification in §§. (For example: http://www.gesetze-im-internet.de/stgb/BJNR001270871.html) Now I want to write a python regex script to automatically tag specific §§. The relevant html code of the document is:

"<div class="jnnorm" id="BJNR398310001BJNE000100305" title="Einzelnorm"><div 

class="jnheader"> <a name="BJNR398310001BJNE000100305"/><a 

href="index.html#BJNR398310001BJNE000100305">Nichtamtliches    Inhaltsverzeichnis</a>h3><span 

class="jnenbez">&#167; 1</span>&#160;<span class="jnentitel"></span></h3> </div>"

Here "div class="jnnorm" should become "div class="jnnorm MYTAGHERE". The last element in "class="jnenbez">§ 1" contains the number of the §, here § 1.

I am trying (and failing) to write a script that does the following:

1) Lets say I have a dictionary my_dict = [112, 204]

2) Find "<span class="jnenbez">§ 112" and "<span class="jnenbez">§ 204" in the .htm file

3) Go left from "jnenbez">§ 112" to the next "jnnorm" string and replace it with "jnnorm MYTAGHERE".

Here is what I got so far, but I hit a roadblock quite soon.

f = file("filename.htm","r")
text = f.read()
import re
my_dict=[1,123,200]
# dont know how to find the §   
re.sub("jnnorm", "jnnorm MYTAGHERE", text)
#re.sub does not seem to work?

I always like opportunities to link to this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Have you considered using a HTML parser like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)? — Jacob, Aug 19 '11 at 09:06
Or lxml? BeautifulSoup's latest version is old, although it does the job and a new version is along the way. — ustun, Aug 19 '11 at 09:13

score 0 · Answer 1 · answered Sep 29 '13 at 13:23

using BeautifulSoup, retrieve class attribute's value.

from BeautifulSoup import BeautifulSoup     
findAll('class')

will return list of values of attributes 'class'.

ex. with this doc

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')

gives

# [<b>one</b>, <b>two</b>]

then, use regex or simply test if element in your list is in one of the element of the list.

this answers 1. and 2. from your question.

score 0 · Accepted Answer · answered Aug 19 '11 at 09:35

0

re.sub doesn't change the string, it returns a new (modified) string instead. If you want the text variable to change you should assign the new value to it:

text = re.sub("jnnorm", "jnnorm MYTAGHERE", text)

Or simpler (given that regular expressions seem to be overdimensioned for a simple string replace):

text = text.replace("jnnorm", "jnnorm MYTAGHERE")

But for anything more complicated - yes, you should consider using a proper HTML parser.

answered Aug 19 '11 at 09:35

Wladimir Palant

56,865
12
98
126

Strings are immutable, so it's a given it doesn't change the string. – Cassandra S. Aug 19 '11 at 09:59

How to use regular expressions on a .html file in python?

2 Answers2