Extract html data using regular expressions

Question

I have an html page that looks like this

<tr>
    <td align=left>
        <a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="history/2c0b65635b3ac68a4d53b89521216d26_0.html" title="C.">Th</a>
    </td>
</tr>
<tr align=right>
    <td align=left>
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37_0.html" title="b">aa</a>
    </td>
</tr>

I need to get the text

history/2c0b65635b3ac68a4d53b89521216d26.html marketing/3c0a65635b2bc68b5c43b88421306c37.html

I wrote a script in python that uses regular expressions

import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))

where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?

Instead of regex try using BeautifulSoup. – f.rodrigues Dec 27 '14 at 06:11 — f.rodrigues, Dec 27 '14 at 06:11

score 3 · Accepted Answer · edited May 23 '17 at 10:27

Don't use regex for parsing HTML content.

Use a specialized tool - an HTML Parser.

Example (using BeautifulSoup):

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""Your HTML here"""

soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
    print link['href']

Prints:

history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Or, if you want to get the href values that follow a pattern, use:

import re

for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
    print link['href']

where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.

DEMO.

score 2 · Answer 2 · answered Dec 27 '14 at 06:14

You can also write something like

>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
...     print a['href']
... 
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Extract html data using regular expressions

2 Answers2