scraping html with python regex

Question

I've some problem with regex in python. I've some html pages which contain useful informantion for me. At the time the pages were saved the encodig charset was a kind of iso... which saved all the German typical letters encoded eg. like "Fr%C3%BCchte" for Früchte and son on. The html is really bad structured so that the only reasonably way to scrape it is using regex.

I've this regex in python:

re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')

unfortunately is not really exactly what I want, because the encoded words will be fetched only partially eg. the result will be:

[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
 ('showSubGroups', '160400', "', 'Rumtopf"),
 ('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
 ('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
 ('showSubGroups', '160100', "', 'Spirituosen, allgemein")]

maybe I'm tired, but I can't see where is the error:

hir the html:

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>                </tbody></table>
            </td>
        </tr>

the canonical response to this sort of question http://stackoverflow.com/a/1732454/735204 I would recommend against using regex to parse HTML - maybe consider instead a library like BeautifulSoup or lxml that allow the use of XPaths for HTML parsing — Emmett Butler, Aug 29 '12 at 22:17
Hmm, that canonical response seems overly dramatic. And it may even be correct that you can't parse HTML with regex. But you _can_ extract information from it. Which is kind of the point here. — Roland Smith, Aug 29 '12 at 22:37
@RolandSmith sure you *can* (for a limited subset, at least), the point is there are easier and better ways. — Hamish, Aug 29 '12 at 23:31

Roland Smith · Answer 1 · 2012-08-29T22:51:57.407

Try this:

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

With your text as input, it gives the following:

In [7]: f.findall(txt)
Out[7]:  [('160500', 'Fr%C3%BCchte in Alkohol'), ('160400', 'Rumtopf'), ('160300', 'Spirituosen (Bio)'), ('160200', 'Spirituosen zur Verarbeitung in der Confiserie'), ('160100', 'Spirituosen, allgemein')]

As far as decoding the %C3%BC (for 'ü') goes, it seems just to be UTF-8 from the Latin 1 block with some extra '%' thrown in, because it decodes if you replace the '%' with '\x':

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC is the unicode for ü.

score 0 · Answer 2 · answered Aug 29 '12 at 22:24

0

Beautiful Soup is a great library to parse html.

Once you have extracted the hrefs from the html, then using regex should be pretty easy.

answered Aug 29 '12 at 22:24

varunl

19,499
5
29
47

scraping html with python regex

2 Answers2