2

I'm trying to get the <li>'s of an html using python's library BeautifulSoup.

The HTML im trying to parse is this one:

https://ccnav6.com/ccna-4-chapter-1-exam-answers-2017-v5-0-3-v6-0-full-100.html

It contains a list of questions and answers and I'm trying to parse those.

My Problem is, that no matter how I go about to parse the html, I only get the first <li>.

My Code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

url = 'https://ccnav6.com/ccna-4-chapter-1-exam-answers-2017-v5-0-3-v6-0-full-100.html'
uClient = uReq(url)
# getting html from connection
page_html = uClient.read()
# close connection
uClient.close()
# use beautifulSoup to parse html
page_soup = soup(page_html, "html.parser")
# get main content of page
contentBlock = page_soup.find("div",{"class":"post-single-content box mark-links entry-content"})
# get all questions and answers
questions = questions = contentBlock.div.ol.li.ol.findAll("li")
# for some reason i'm only getting the first question
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
Time4Boom
  • 55
  • 1
  • 6
  • Change the parser from `html.parser` to `lxml` and it'll work. Not exactly sure why though, maybe the HTML is broken. You'll need to download that parser first. `pip install lxml`. – Keyur Potdar Mar 28 '18 at 05:21
  • @KeyurPotdar wow, thank you so much. Really weird behaviour ... i'm new to web-scraping and was sitting here for a few hours not understanding why it only outputs the first element ... – Time4Boom Mar 28 '18 at 05:25
  • 1
    The HTML there is broken, as it contains `` closing tags without opening tags. Try one of the different parsers, so `lxml` or `html5lib`. – Martijn Pieters Mar 28 '18 at 09:10
  • 1
    Both `lxml` and `html5lib` produce 27 `li` elements, `html.parser` really doesn't like those stray closing tags. – Martijn Pieters Mar 28 '18 at 09:15
  • @MartijnPieters oh, I didn't even notice that. Thanks for the help. – Time4Boom Mar 28 '18 at 10:32

0 Answers0