accessing html parsed data in python using lists

Question

I have parsed a html document in python and i am storing the contents of the body tag in a list. Below is the code:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)
print data

the output of the following is:

        6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1

I want to store each new line in a new list. Need help in doing this. I am new to python. Thanks, ghbhatt.

score 3 · Answer 1 · edited May 23 '17 at 12:03

3

Don't use regex to parse html: RegEx match open tags except XHTML self-contained tags

Instead, there are a number of great parsers in Python:

http://www.crummy.com/software/BeautifulSoup/

http://lxml.de/

Use one of those and, in general, getting a list of the contents will just be part of what the library does.

edited May 23 '17 at 12:03

Community

1
1

answered Feb 03 '12 at 10:57

Glenn

7,262
1
17
23

score 2 · Accepted Answer · 2012-02-03T11:05:48.633

#!/bin/python

data = """6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1"""

lists = [line.split() for line in data.split("\n")]

print lists

Edit: data.splitlines() is probably more portable than data.split("\n").

score 2 · Answer 3 · answered Feb 03 '12 at 10:54

l = []
for line in data.splitlines():
    l.append(line.split())

or

l = [line.split() for line in data.splitlines()]

l is now:

[['6', '3'],
 ['12603', '235', '1'],
 ['37210', '363', '3'],
 ['64618', '348', '2'],
 ['4', '4'],
 ['80073', '560', '1'],
 ['80560', '504', '1'],
 ['80875', '807', '1'],
 ['80917', '636', '1']]

This stores the data as list of lists of strings. If you know there are integers only, you can do:

l = []
for line in data.splitlines():
    l.append([int(a) for a in line.split()])

or

l = []
for line in data.splitlines():
    l.append(map(int, line.split()))

or

l = [map(int, line.split()) for line in data.splitlines()]

which creates:

[[6, 3],
 [12603, 235, 1],
 [37210, 363, 3],
 [64618, 348, 2],
 [4, 4],
 [80073, 560, 1],
 [80560, 504, 1],
 [80875, 807, 1],
 [80917, 636, 1]]

score 1 · Answer 4 · answered Feb 03 '12 at 10:54

Use split method to split the string into lines and than to particular columns:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)

list_data = []
data_lines = data.split("\n")  # Split the string to list of lines
for line in data_lines: 
    row = line.split()  # Split the line to numbers
    list_data.append(row)

for row in list_data:
    print row

score 0 · Answer 5 · answered Feb 03 '12 at 10:58

0

I'm not sure is that you want:

[re.findall(r'\d+', line) for line in data.split('\n')]

answered Feb 03 '12 at 10:58

ptitpoulpe

684
4
17

accessing html parsed data in python using lists

5 Answers5