I've some problem with regex in python. I've some html pages which contain useful informantion for me. At the time the pages were saved the encodig charset was a kind of iso... which saved all the German typical letters encoded eg. like "Fr%C3%BCchte" for Früchte and son on. The html is really bad structured so that the only reasonably way to scrape it is using regex.
I've this regex in python:
re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')
unfortunately is not really exactly what I want, because the encoded words will be fetched only partially eg. the result will be:
[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
('showSubGroups', '160400', "', 'Rumtopf"),
('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
('showSubGroups', '160100', "', 'Spirituosen, allgemein")]
maybe I'm tired, but I can't see where is the error:
hir the html:
<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
</tr>
<tr valign="top">
<td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
</tr> <tr valign="top">
<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
</tr>
<tr valign="top">
<td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
</tr> <tr valign="top">
<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
</tr>
<tr valign="top">
<td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
</tr> <tr valign="top">
<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
</tr>
<tr valign="top">
<td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
</tr> <tr valign="top">
<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
</tr>
<tr valign="top">
<td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
</tr> </tbody></table>
</td>
</tr>