0

I am trying to find the first table in a HTML file and copy everything of the table to a string s

f = open('page.html' , 'r')
s = ""
for line in f.readlines():
  line = line.strip()
  if line.find('<table'):
    s += line
  if line.find('</table>'):
    break
print s

This code is not working. How do I solve it using the standard python library?

VeilEclipse
  • 2,766
  • 9
  • 35
  • 53

3 Answers3

0

Try using XPATH maybe, see this SO question: Parse HTML via XPath

Community
  • 1
  • 1
mpcabd
  • 1,813
  • 15
  • 20
0

If you have to stick to the standard library, then it's obvious that you need the contents between the first <table> and the last </table>.

To do this, you'll need a stack. Read the file from the beginning. Whenever you encounter a <table>, push its position on the stack, and whenever you see a </table>, pop one from the stack. This will ensure matching the </table>s with their corresponding <table>s.

Look out for the last </table> - if popping from the stack makes it empty, then it must close the first <table>, so store this position.

Now you have the positions of the first <table> and the last </table> - so you can copy all the contents between them to a string.

Sufian Latif
  • 13,086
  • 3
  • 33
  • 70
0

You can use regular expression for this.

import re
tbl_pat = re.compile(r'<table(.*?)>(.*)</table>')
f = open('page.html' , 'r')
for line in f.readlines():
    m = tbl_pat.match(line)
    if m:
        print m.group(2)
        break
salmanwahed
  • 9,450
  • 7
  • 32
  • 55