-2

I am using python 2.7.8. I have a website which contains text written with bullets list which is ordered list aka <ol> . I want to extract those text i.e

Coffee
Tea
Milk

My html code:

<!DOCTYPE html>
<html>
<body>

<ol type="I">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>
<ol type="a">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>

<ol type="1">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>


</body>
</html>

The code which is i am constantly trying is not working bcz on the way i am every time getting Error.

Python code:

import urllib2
from urllib2 import Request
import re
from bs4 import BeautifulSoup

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
c=0;
soup = BeautifulSoup(htmls, 'lxml')
#skipp portion of code
res2 = soup.find('h1',attrs={"class":"entry-title"})
br = soup.find('span',attrs={'class':'IL_ADS'})
br = soup.find('p').text # separate title

for question in soup.find_all(text=re.compile(r"^\d+\.")):
    answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
    #s = ''.join([i for i in question if not i.isdigit()])
    if not answers:
        break

    print question.encode('utf-8')
    ul = question.find_next_sibling("ul")
    print(ul.get_text(' ', strip=True))

but when i run this code i got also Error:

Traceback (most recent call last):
  File "C:\Users\DELL\Desktop\python\s\fyp\crawldataextraction.py", line 47, in <module>
    print(ul.get_text(' ', strip=True))
AttributeError: 'NoneType' object has no attribute 'get_text'
user3440716
  • 639
  • 2
  • 12
  • 23
  • Possible duplicate of [Extracting text from HTML file using Python](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) – Psytho Dec 21 '15 at 11:32
  • The tags you're searching is `ul` in your code, but I can only see `ol` tags in your HTML file. Isn't this a typo? – Remi Guan Dec 21 '15 at 11:38