Finding the first table in a HTML file in python

Question

I am trying to find the first table in a HTML file and copy everything of the table to a string s

f = open('page.html' , 'r')
s = ""
for line in f.readlines():
  line = line.strip()
  if line.find('<table'):
    s += line
  if line.find('</table>'):
    break
print s

This code is not working. How do I solve it using the standard python library?

`BeautifulSoup` please: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — sshashank124, Apr 10 '14 at 08:41
Your first line.find is missing a > after in s. Also, what if you get an HTML document with everything on one line (no line breaks)? :) — simon, Apr 10 '14 at 09:19
@gurka It's not a bug, it's a feature :) It's needed for tags like ``. — Sufian Latif, Apr 10 '14 at 09:22
@VeilEclipse did the solutions helped you? If does, how about accepting one? And if doesn't tell also. — salmanwahed, Apr 10 '14 at 22:09

score 0 · Answer 1 · edited May 23 '17 at 10:33

0

Try using XPATH maybe, see this SO question: Parse HTML via XPath

edited May 23 '17 at 10:33

Community

1
1

answered Apr 10 '14 at 08:48

mpcabd

1,813
15
20

Sufian Latif · Answer 2 · 2014-04-10T09:23:32.420

If you have to stick to the standard library, then it's obvious that you need the contents between the first <table> and the last </table>.

To do this, you'll need a stack. Read the file from the beginning. Whenever you encounter a <table>, push its position on the stack, and whenever you see a </table>, pop one from the stack. This will ensure matching the </table>s with their corresponding <table>s.

Look out for the last </table> - if popping from the stack makes it empty, then it must close the first <table>, so store this position.

Now you have the positions of the first <table> and the last </table> - so you can copy all the contents between them to a string.

salmanwahed · Answer 3 · 2014-04-10T09:35:33.753

0

You can use regular expression for this.

import re
tbl_pat = re.compile(r'<table(.*?)>(.*)</table>')
f = open('page.html' , 'r')
for line in f.readlines():
    m = tbl_pat.match(line)
    if m:
        print m.group(2)
        break

edited Apr 10 '14 at 09:35

answered Apr 10 '14 at 09:25

salmanwahed

9,450
7
32
55

It won't capture tags like ``.
– Sufian Latif Apr 10 '14 at 09:29
well I did not got it first time. thank you for pointing out. – salmanwahed Apr 10 '14 at 09:36

Finding the first table in a HTML file in python

3 Answers3