14

I have following text processed by my code in Python:

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>

Could you advice me how to extract data from within <td>? My idea is to put it in a CSV file with the following format: some link, some data 1, some data 2, some data 3.

I expect that without regular expression it might be hard but truly I still struggle against regular expressions.

I used my code more or less in following manner:

tabulka = subpage.find("table")

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
print col[0]

and ideally would be to get each td contend in some array. Html above is a result from python.

Cairnarvon
  • 25,981
  • 9
  • 51
  • 65
Lormitto
  • 477
  • 2
  • 7
  • 19
  • You can use Beautifoulsoup http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html to extract the information rather then using regular expression. – jassinm Jun 15 '13 at 18:32
  • Don't use regular expressions to parse non-trivial HTML. – Cairnarvon Jun 15 '13 at 19:35
  • I can use BeautifulSoup to extract data from td but what to do next to work on data inside? – Lormitto Jun 15 '13 at 19:45

3 Answers3

16

Get BeautifulSoup and just use it. It's great.

$> easy_install pip
$> pip install BeautifulSoup
$> python
>>> from BeautifulSoup import BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>> soup = BS(html)
>>> elem = soup.findAll('a', {'title': 'title here'})
>>> elem[0].text
yurisich
  • 6,991
  • 7
  • 42
  • 63
  • 2
    findAll() returns a list or None, and lists and None don't have a .text() method, so your last line is always an error. – 7stud Jun 15 '13 at 18:39
  • 1
    as 7stud stated above it is a problem indeed – Lormitto Jun 15 '13 at 19:03
  • 2
    As of today, the modules are replaced with below modules https://pypi.org/project/beautifulsoup4/ https://docs.python.org/3/library/urllib.request.html Ref https://stackoverflow.com/questions/2792650/import-error-no-module-name-urllib2 – Krishna Aug 23 '20 at 17:10
8

You shouldn't use regexes on html. You should use BeautifulSoup or lxml. Here are some examples using BeautifulSoup:

Your td tags actually look like this:

<td>newline
<a>some link</a>newline
<br />newline
some data 1<br />newline
some data 2<br />newline
some data 3</td>

So td.text looks like this:

<newline>some link<newline><newline>some data 1<newline>some data 2<newline>some data 3

You can see that each string is separated by at least one newline, so that enables you to separate out each string.

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    inner_text = td.text
    strings = inner_text.split("\n")

    csv_data.extend([string for string in strings if string])

print(",".join(csv_data))

--output:--
some link,some data 1,some data 2,some data 3

Or more concisely:

for td in tds:
    print(re.sub("\n+", ",", td.text.lstrip() ) ) 

--output:--
some link,some data 1,some data 2,some data 3

But that solution is brittle because it won't work if your html looks like this:

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>

Now td.text looks like this:

<newline>some link<newline>some data 1some data2some data3

And there isn't a way to figure out where some of the strings begin and end. But that just means you can't use td.text--there are still other ways to identify each string:

1)

from bs4 import BeautifulSoup as bs
import re

html = """<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />some data 1<br />some data 2<br />some data 3</td>"""

soup = bs(html)
tds = soup.find_all('td')
csv_data = []

for td in tds:
    a_tags = td.find_all('a')

    for a_tag in a_tags:
        csv_data.append(a_tag.text)
        br_tags = a_tag.findNextSiblings('br')

        for br in br_tags:
            csv_data.append(br.next.strip())  #get the element after the <br> tag

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

2)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    for string in a_tag.findNextSiblings(text=True):  #find only text nodes
        string = string.strip()
        if string: csv_data.append(string)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3

3)

for td in tds:
    a_tag = td.find('a')
    if a_tag: csv_data.append(a_tag.text)

    text_strings = a_tag.findNextSiblings( text=re.compile('\S+') )  #find only non-whitespace text nodes
    csv_data.extend(text_strings)

csv_str = ",".join(csv_data)
print(csv_str)

--output:--
some link,some data 1,some data 2,some data 3
7stud
  • 46,922
  • 14
  • 101
  • 127
1

I've never used BeautifulSoup, but I would bet that it is 'html-tag-aware' and can handle 'filler' space. But since html markup files are structured (and usually generated by a web design program), you can also try a direct approach using Python's .split() method. Incidentally, I recently used this approach to parse out a real world url/html to do something very similar to what the OP wanted.

Although the OP wanted to pull only one field from the <a> tag, below we pull the 'usual two' fields.

CODE:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Extracting data from HTML using split()
# Link: https://stackoverflow.com/questions/17126686/extracting-data-from-html-with-python
#--------*---------*---------*---------*---------*---------*---------*---------*

import sys

page     = """blah blah blah
<td>
<a href="http://www.link1tosomewhere.net" title="title1 here">some link1</a>
<br />
some data1 1<br />
some data1 2<br />
some data1 3</td>
mlah mlah mlah
<td>
<a href="http://www.link2tosomewhere.net" title="title2 here">some link2</a>
<br />
some data2 1<br />
some data2 2<br />
some data2 3</td>
flah flah flah
"""

#--------*---------*---------*---------*---------*---------*---------*---------#
while 1:#                          M A I N L I N E                             #
#--------*---------*---------*---------*---------*---------*---------*---------#
    page = page.replace('\n','')   # remove \n from test html page
    csv = ''
    li = page.split('<td><a ')
    for i in range(0, len(li)):
        if li[i][0:6] == 'href="':
            s = li[i].split('</td>')[0]
#                                  # li2 ready for csv            
            li2 = s.split('<br />')
#                                  # create csv file
            for j in range(0, len(li2)):
#                                  # get two fields from li2[0]               
                if j == 0:
                    li3 = li2[0].split('"')
                    csv = csv + li3[1] + ','
                    li4 = li3[4].split('<')
                    csv = csv + li4[0][1:] + ','
#                                  # no comma on last field - \n instead
                elif j == len(li2) - 1:
                    csv = csv + li2[j] + '\n'
#                                  # just write out middle stuff                    
                else:
                    csv = csv + li2[j] + ','
    print(csv)                    
    sys.exit()

OUTPUT:

>>> 
= RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\board.py =
http://www.link1tosomewhere.net,some link1,some data1 1,some data1 2,some data1 3
http://www.link2tosomewhere.net,some link2,some data2 1,some data2 2,some data2 3

>>> 
CopyPasteIt
  • 532
  • 1
  • 8
  • 22