-1

I have tried the below code for finding the underlined text in a html file, but it is not working.

f=open("jk.html","r")
while True:
    for line in f.read():
        for i in line.split():
            j=i.find("<ul>")
            k=i.find("</ul>")
            for m in range(j, k):
                print(m)

f.close()

Here is my HTML file:

<html>
<body>
   <ul> hill </ul>
   <p> millfhhf </p>
</body>
</html>
Steinar Lima
  • 7,644
  • 2
  • 39
  • 40

2 Answers2

1

This becomes really simple if you use the BeautifulSoup module, which is going to be far better at parsing HTML (especially if it is messy HTML).

import bs4

f = open("test.html")
soup = bs4.BeautifulSoup(f)

for underlined in soup.find_all('u'):
    print underlined.get_text()

Also, the tag for underlined text in HTML is <u>

<html>
<body>
   <p>
       <u> hill </u>
       <u> millfhhf </u>
   </p>
</body>
</html>
mdadm
  • 1,333
  • 1
  • 12
  • 9
  • Yes, you need to install it. It isn't included with Python by default. What operating system are you using? – mdadm Mar 22 '14 at 06:46
  • windows 7 operating system – Sumeet ten Doeschate Mar 22 '14 at 06:48
  • You'll want to install it either using pip or easy_install (obtained with the python setuptools). See this stackoverflow [question](http://stackoverflow.com/questions/12228102/how-to-install-beautiful-soup-4-with-python-2-7-on-windows) for instructions. – mdadm Mar 22 '14 at 06:52
  • Did that help, were you able to get bs4 installed? – mdadm Mar 24 '14 at 13:56
0

This code does not work because read() returns the rest of the file and then you iterate over it char by char. For lines use readline() or just iterate over the file:

for line in fp:
    # do whatever

That said, use htmlparser or BeautifulSoup or an XML parser for any reliable parsing.

Also, the tag for the underlining is <u>, not <ul>.

bereal
  • 32,519
  • 6
  • 58
  • 104