1

I've some problem with regex in python. I've some html pages which contain useful informantion for me. At the time the pages were saved the encodig charset was a kind of iso... which saved all the German typical letters encoded eg. like "Fr%C3%BCchte" for Früchte and son on. The html is really bad structured so that the only reasonably way to scrape it is using regex.

I've this regex in python:

re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')

unfortunately is not really exactly what I want, because the encoded words will be fetched only partially eg. the result will be:

[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
 ('showSubGroups', '160400', "', 'Rumtopf"),
 ('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
 ('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
 ('showSubGroups', '160100', "', 'Spirituosen, allgemein")]

maybe I'm tired, but I can't see where is the error:

hir the html:

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>                </tbody></table>
            </td>
        </tr>
xaverras
  • 2,577
  • 2
  • 17
  • 15
  • 3
    the canonical response to this sort of question http://stackoverflow.com/a/1732454/735204 I would recommend against using regex to parse HTML - maybe consider instead a library like BeautifulSoup or lxml that allow the use of XPaths for HTML parsing – Emmett Butler Aug 29 '12 at 22:17
  • 1
    Hmm, that canonical response seems overly dramatic. And it may even be correct that you can't parse HTML with regex. But you _can_ extract information from it. Which is kind of the point here. – Roland Smith Aug 29 '12 at 22:37
  • 1
    @RolandSmith sure you *can* (for a limited subset, at least), the point is there are easier and better ways. – Hamish Aug 29 '12 at 23:31

2 Answers2

1

Try this:

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

With your text as input, it gives the following:

In [7]: f.findall(txt)
Out[7]:  [('160500', 'Fr%C3%BCchte in Alkohol'), ('160400', 'Rumtopf'), ('160300', 'Spirituosen (Bio)'), ('160200', 'Spirituosen zur Verarbeitung in der Confiserie'), ('160100', 'Spirituosen, allgemein')]

As far as decoding the %C3%BC (for 'ü') goes, it seems just to be UTF-8 from the Latin 1 block with some extra '%' thrown in, because it decodes if you replace the '%' with '\x':

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC is the unicode for ü.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
0

Beautiful Soup is a great library to parse html.

Once you have extracted the hrefs from the html, then using regex should be pretty easy.

varunl
  • 19,499
  • 5
  • 29
  • 47