0

I have parsed a html document in python and i am storing the contents of the body tag in a list. Below is the code:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)
print data

the output of the following is:

        6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1

I want to store each new line in a new list. Need help in doing this. I am new to python. Thanks, ghbhatt.

gsb
  • 5,520
  • 8
  • 49
  • 76

5 Answers5

3

Don't use regex to parse html: RegEx match open tags except XHTML self-contained tags

Instead, there are a number of great parsers in Python:

http://www.crummy.com/software/BeautifulSoup/

http://lxml.de/

Use one of those and, in general, getting a list of the contents will just be part of what the library does.

Community
  • 1
  • 1
Glenn
  • 7,262
  • 1
  • 17
  • 23
2
#!/bin/python

data = """6          3
    12603        235          1
    37210        363          3
    64618        348          2
        4          4
    80073        560          1
    80560        504          1
    80875        807          1
    80917        636          1"""

lists = [line.split() for line in data.split("\n")]

print lists

Edit: data.splitlines() is probably more portable than data.split("\n").

2
l = []
for line in data.splitlines():
    l.append(line.split())

or

l = [line.split() for line in data.splitlines()]

l is now:

[['6', '3'],
 ['12603', '235', '1'],
 ['37210', '363', '3'],
 ['64618', '348', '2'],
 ['4', '4'],
 ['80073', '560', '1'],
 ['80560', '504', '1'],
 ['80875', '807', '1'],
 ['80917', '636', '1']]

This stores the data as list of lists of strings. If you know there are integers only, you can do:

l = []
for line in data.splitlines():
    l.append([int(a) for a in line.split()])

or

l = []
for line in data.splitlines():
    l.append(map(int, line.split()))

or

l = [map(int, line.split()) for line in data.splitlines()]

which creates:

[[6, 3],
 [12603, 235, 1],
 [37210, 363, 3],
 [64618, 348, 2],
 [4, 4],
 [80073, 560, 1],
 [80560, 504, 1],
 [80875, 807, 1],
 [80917, 636, 1]]
eumiro
  • 207,213
  • 34
  • 299
  • 261
1

Use split method to split the string into lines and than to particular columns:

import urllib, re
text = urllib.urlopen("http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&g=p&v=algorithm&v=javed").read()
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)

list_data = []
data_lines = data.split("\n")  # Split the string to list of lines
for line in data_lines: 
    row = line.split()  # Split the line to numbers
    list_data.append(row)

for row in list_data:
    print row
Mariusz Jamro
  • 30,615
  • 24
  • 120
  • 162
0

I'm not sure is that you want:

[re.findall(r'\d+', line) for line in data.split('\n')]
ptitpoulpe
  • 684
  • 4
  • 17