Using python 3.3 I'm trying to make some regular expression substitutes unsuccessfully.
I want to strip all attributes of the td
tags except the rowspan
attribute (example td's at the end).
Using following command I can substitute successfully when rowspan
exists:
re.sub('(<td)[^>]*([\\s]rowspan[\\s]*=[\\s]*[0-9]*)[^>]*(>)', handle_td, file_contents)
where handle_td
is:
def handle_td(matchobj):
new_td = ''
for curr_group in matchobj.groups(''):
if curr_group != '':
new_td += curr_group
return new_td
But I want also to take care of the rest of the td
's. This I didn't manage.
If I add ?
after the second group it changes td tag to and does not keep the rowspan
attribute.
What am I doing wrong? How can I fix this?
I don't mined running another command to handle the other td
's but I didn't manage...
<td width=307 valign=top style='width:230.3pt;border:solid windowtext 1.0pt; border-left:none;padding:0cm 5.4pt 0cm 5.4pt'>
<td width=307 rowspan=4 style='width:230.3pt;border:solid windowtext 1.0pt; border-top:none;padding:0cm 5.4pt 0cm 5.4pt'>
<td width=307 valign=top style='width:230.3pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt'>
This should produce:
<td>
<td rowspan=4>
<td>
I managed this way (if you have a better way feel free to add it):
# Leave only specific attributes for td tags
def filter_td_attributes(matchobj):
if matchobj.group(1) == "rowspan":
return matchobj.group(1) + '=' + matchobj.group(2)
# Loop the attributes of the td tags
def handle_td(matchobj):
new_td = re.sub("([a-zA-Z]+)[\\s]*=[\\s]*([a-zA-Z0-9:;.\\-'\\s]*)([\\s]|>)", filter_td_attributes, matchobj.group(0))
new_td = re.sub("[\\s]*$", '', new_td)
new_td = new_td + ">" # close the td tag
return new_td
file_contents = re.sub('[\\s]*</p>[\\s]*</td>', '</td>', file_contents)