0

I'm very new to programming and I would really appreciate any help! I am trying to write this little python script:

I have an .html file of a legal codification in §§. (For example: http://www.gesetze-im-internet.de/stgb/BJNR001270871.html) Now I want to write a python regex script to automatically tag specific §§. The relevant html code of the document is:

"<div class="jnnorm" id="BJNR398310001BJNE000100305" title="Einzelnorm"><div 

class="jnheader"> <a name="BJNR398310001BJNE000100305"/><a 

href="index.html#BJNR398310001BJNE000100305">Nichtamtliches    Inhaltsverzeichnis</a>h3><span 

class="jnenbez">&#167; 1</span>&#160;<span class="jnentitel"></span></h3> </div>"

Here "div class="jnnorm" should become "div class="jnnorm MYTAGHERE". The last element in "class="jnenbez">&#167; 1" contains the number of the §, here § 1.

I am trying (and failing) to write a script that does the following:

1) Lets say I have a dictionary my_dict = [112, 204]

2) Find "<span class="jnenbez">&#167; 112" and "<span class="jnenbez">&#167; 204" in the .htm file

3) Go left from "jnenbez">&#167; 112" to the next "jnnorm" string and replace it with "jnnorm MYTAGHERE".

Here is what I got so far, but I hit a roadblock quite soon.

f = file("filename.htm","r")
text = f.read()
import re
my_dict=[1,123,200]
# dont know how to find the §   
re.sub("jnnorm", "jnnorm MYTAGHERE", text)
#re.sub does not seem to work?
CSᵠ
  • 10,049
  • 9
  • 41
  • 64
Elip
  • 551
  • 1
  • 4
  • 14
  • 8
    I always like opportunities to link to this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Have you considered using a HTML parser like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)? – Jacob Aug 19 '11 at 09:06
  • 3
    Or lxml? BeautifulSoup's latest version is old, although it does the job and a new version is along the way. – ustun Aug 19 '11 at 09:13

2 Answers2

0

using BeautifulSoup, retrieve class attribute's value.

from BeautifulSoup import BeautifulSoup     
findAll('class')

will return list of values of attributes 'class'.

ex. with this doc

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')

gives

# [<b>one</b>, <b>two</b>]

then, use regex or simply test if element in your list is in one of the element of the list.

this answers 1. and 2. from your question.

kiriloff
  • 25,609
  • 37
  • 148
  • 229
0

re.sub doesn't change the string, it returns a new (modified) string instead. If you want the text variable to change you should assign the new value to it:

text = re.sub("jnnorm", "jnnorm MYTAGHERE", text)

Or simpler (given that regular expressions seem to be overdimensioned for a simple string replace):

text = text.replace("jnnorm", "jnnorm MYTAGHERE")

But for anything more complicated - yes, you should consider using a proper HTML parser.

Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126