how to remove number, dot and alphabets bullets from list while extracting text from website:

Question

I am using python 2.7.8. I have a website which contains text written with bullets list which is ordered list aka <ol> . I want to extract those text i.e

Coffee
Tea
Milk

My html code:

<!DOCTYPE html>
<html>
<body>

<ol type="I">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>
<ol type="a">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>

<ol type="1">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>


</body>
</html>

The code which is i am constantly trying is not working bcz on the way i am every time getting Error.

Python code:

import urllib2
from urllib2 import Request
import re
from bs4 import BeautifulSoup

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
c=0;
soup = BeautifulSoup(htmls, 'lxml')
#skipp portion of code
res2 = soup.find('h1',attrs={"class":"entry-title"})
br = soup.find('span',attrs={'class':'IL_ADS'})
br = soup.find('p').text # separate title

for question in soup.find_all(text=re.compile(r"^\d+\.")):
    answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
    #s = ''.join([i for i in question if not i.isdigit()])
    if not answers:
        break

    print question.encode('utf-8')
    ul = question.find_next_sibling("ul")
    print(ul.get_text(' ', strip=True))

but when i run this code i got also Error:

Traceback (most recent call last):
  File "C:\Users\DELL\Desktop\python\s\fyp\crawldataextraction.py", line 47, in <module>
    print(ul.get_text(' ', strip=True))
AttributeError: 'NoneType' object has no attribute 'get_text'

Possible duplicate of [Extracting text from HTML file using Python](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) — Psytho, Dec 21 '15 at 11:32
The tags you're searching is `ul` in your code, but I can only see `ol` tags in your HTML file. Isn't this a typo? — Remi Guan, Dec 21 '15 at 11:38

Caius · Accepted Answer · 2015-12-22T00:21:21.453

You can see why it is that beautifulsoup function does not work for your variable 'ul" by inserting this line while commenting out the line you previously had.

print ul
"""print(ul.get_text(' ', strip=True))"""

What is happening is that your variable ul is storing the string:

C99 standard guarantees uniqueness of ____ characters for internal names. None
C99 standard guarantess uniqueness of _____ characters for external names. None
Which of the following is not a valid variable name declaration? None
Which of the following is not a valid variable name declaration? None
Variable names beginning with underscore is not encouraged. Why? None
All keywords in C are in None
Variable name resolving (number of significant characters for uniqueness of variable) depends on None
Which of the following is not a valid C variable name? None
Which of the following is true for variable names in C? None

But since there is no ul tag for beautifulsoup to find inside of ul, your ul.get_text method does not work. So in this case, the way I would go about stripping the spaces would be to use regex.

Answer about removing number and dots.

import urllib2
from urllib2 import Request
import re
from bs4 import BeautifulSoup

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
c = 0
soup = BeautifulSoup(htmls, 'lxml')
# skipp portion of code
res2 = soup.find('h1', attrs={"class": "entry-title"})
br = soup.find('span', attrs={'class': 'IL_ADS'})
br = soup.find('p').text  # separate title

for question in soup.find_all(text=re.compile(r"^\d+\.")):
    answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
    # s = ''.join([i for i in question if not i.isdigit()])
    if not answers:
        break
    ul = question.encode('utf-8')
    ol = re.compile('[\d][.]')
    ol = ol.sub(' ', str(ul))
    print ol
    """print(ul.get_text(' ', strip=True))"""

Output:

  C99 standard guarantees uniqueness of ____ characters for internal names.
  C99 standard guarantess uniqueness of _____ characters for external names.
  Which of the following is not a valid variable name declaration?
  Which of the following is not a valid variable name declaration?
  Variable names beginning with underscore is not encouraged. Why?
  All keywords in C are in
  Variable name resolving (number of significant characters for uniqueness of variable) depends on
  Which of the following is not a valid C variable name?
  Which of the following is true for variable names in C?

I used regex to compile the pattern of number followed by a dot. Then used the re.sub() function to replace it with a space.

thanks Calcus for close response. But how shall i remove those numbers i thought it is ul or ol ??Your sol is printing None.... — user3440716, Dec 21 '15 at 12:13
or example if i want to remove: 1. ________ 2.________ i can also use str.replace("1.","") but this becomes lengthy for me as i ve write upto infinte times. Likewise for: a) ________ b) ________ str.replace("a)","") same here – — user3440716, Dec 21 '15 at 12:24
For the first question, in your case it doesn't seem to matter with using ul or ol as beautifulsoup is picking up the same string (the questions) with either one. Though you should note that ol stands for ordered list (applicable in your case) while ul stands for unordered list. — Caius, Dec 21 '15 at 23:37
For the second, I would recommend looking through the regex documentation and especially re.sub() — Caius, Dec 21 '15 at 23:40
I've updated my answer to show regex in action where it prints out the questions without the number followed by a dot. — Caius, Dec 22 '15 at 00:15

score 0 · Answer 2 · answered Dec 21 '15 at 11:34

0

I never used the BeautifulSoup, but I do this with regular expression:

import re

html = """<!DOCTYPE html>
<html>
<body>

<ol type="I">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>
<ol type="a">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>

<ol type="1">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>


</body>
</html>"""

regexp = re.compile('<li>(\w+)<\/li>')
result = regexp.findall(html)

for i in result:
    print(i)

answered Dec 21 '15 at 11:34

macabeus

4,156
5
37
66

my regex "compile(r"^\d+\.")" is also necessary for me. And if i do compile(r"
^\d+<\/li>.") gets nothing

user3440716

Dec 21 '15 at 11:38

You need use group to get things. You have try use `

(^\d+)<\/li>`? You could use this site to test your regexp: https://regex101.com/

– macabeus Dec 21 '15 at 11:40

how to remove number, dot and alphabets bullets from list while extracting text from website:

2 Answers2

But since there is no ul tag for beautifulsoup to find inside of ul, your ul.get_text method does not work. So in this case, the way I would go about stripping the spaces would be to use regex.