How to strip all attributes from an HTML td tag but rowspan in python?

Question

Using python 3.3 I'm trying to make some regular expression substitutes unsuccessfully.

I want to strip all attributes of the td tags except the rowspan attribute (example td's at the end).

Using following command I can substitute successfully when rowspan exists:

re.sub('(<td)[^>]*([\\s]rowspan[\\s]*=[\\s]*[0-9]*)[^>]*(>)', handle_td, file_contents)

where handle_td is:

def handle_td(matchobj):
    new_td = ''
    for curr_group in matchobj.groups(''):
        if curr_group != '':
            new_td += curr_group
    return new_td

But I want also to take care of the rest of the td 's. This I didn't manage.

If I add ? after the second group it changes td tag to and does not keep the rowspan attribute.

What am I doing wrong? How can I fix this?

I don't mined running another command to handle the other td 's but I didn't manage...

<td width=307 valign=top style='width:230.3pt;border:solid windowtext 1.0pt; border-left:none;padding:0cm 5.4pt 0cm 5.4pt'>
<td width=307 rowspan=4 style='width:230.3pt;border:solid windowtext 1.0pt; border-top:none;padding:0cm 5.4pt 0cm 5.4pt'>
<td width=307 valign=top style='width:230.3pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt'>

This should produce:

<td>
<td rowspan=4>
<td>

I managed this way (if you have a better way feel free to add it):

# Leave only specific attributes for td tags 
def filter_td_attributes(matchobj):
    if matchobj.group(1) == "rowspan":
        return matchobj.group(1) + '=' + matchobj.group(2)

# Loop the attributes of the td tags
def handle_td(matchobj):
    new_td = re.sub("([a-zA-Z]+)[\\s]*=[\\s]*([a-zA-Z0-9:;.\\-'\\s]*)([\\s]|>)", filter_td_attributes, matchobj.group(0))
    new_td = re.sub("[\\s]*$", '', new_td)
    new_td = new_td + ">" # close the td tag
    return new_td

file_contents = re.sub('[\\s]*</p>[\\s]*</td>', '</td>', file_contents)

Dont parse html/xml with regex. Use a parser, it will be *so* much easier. — kreativitea, Dec 05 '12 at 22:28
obligatory reference: http://stackoverflow.com/a/1732454/1350899 — mata, Dec 05 '12 at 22:31
Use [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) instead. — will, Dec 05 '12 at 23:52
I understand what you say but it's an over kill for what I need... I'm not doing a whole parsing just striping some very specific things. — SimonW, Dec 06 '12 at 08:34

score 0 · Answer 1 · answered Dec 05 '12 at 22:39

0

You have to make the [^>]* part of the code non-greedy when the rowspan code is optional: make it [^>]*?. All together it becomes:

'(<td)[^>]*?([\\s]rowspan[\\s]*=[\\s]*[0-9]*)?[^>]*(>)'

The greedy version ([^>]*) means "give me as many non ">" characters as possible, but I will accept zero".

The non-greedy version ([^>]*?) means "give me the least number of non ">" characters as possible while still making the whole regex match"

answered Dec 05 '12 at 22:39

EvilBob22

732
5
12

Hey, at least you got something that works, that is the most important thing. I've also noticed some "extra" stuff that are not really needed: the square brackets around the `\\s` values do nothing, and the parens around the `` are not really needed either -- you generally don't need to dynamically capture static text (and you aren't treating a group of characters as one entity like `( – EvilBob22 Dec 06 '12 at 18:57
Thanks for the answer, the brackets are for grouping so I can use the content of that group in the code. – SimonW Dec 09 '12 at 08:32

How to strip all attributes from an HTML td tag but rowspan in python?

1 Answers1