can't scrape web page from BeautifulSoup or lxml

Question

I am very new to programming so this can be a silly question.I wanted to learn to scrape web pages. so I learned BeautifulSoup to do it.....worked for few sites but got stuck on the following page

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text
soup = BeautifulSoup(data, "lxml")

data = soup.find_all("tbody", {"id": "pageData1"})
data2 = soup.find_all("ul", {"class": "res_allnumber"})
print data
print data2
#no point going further if I cant get raw data I think

this worked fine (a similar site I scraped)

r2  = requests.get("http://www.nlb.lk/results-more.php?id=1")
data2 = r2.text
soup2 = BeautifulSoup(data2, "lxml")
news2 = soup2.find_all("a", {"class": "lottery-numbers"})
#print news2 #(get raw Html for checking)
for draw_number in news2:
   print draw_number.contents[0]

I couldn't scrape the table I wanted.so I tried LXML to do it...still no luck.............

#lxml
import requests

r  = requests.get("http://www.dlb.today/result/en")
data = r.text

#print data

import lxml.html as LH

content = data
root = LH.fromstring(content)
for tag1 in root.xpath('//tbody[@class="pageData1"]//li'):  
    print tag1.text_content()

I don't know where is my error or what to do next......if anyone can anyone point me in the right direction I appreciate it !

Is the data being loaded with javascript? try using `curl` and see if the page has what you are looking for. if it doesn't then it is probably being loaded through javascript. if it is then look into using chrome headless. — jmunsch, Aug 07 '17 at 18:29

score 1 · Answer 1 · answered Aug 07 '17 at 19:57

I tried replicating your use-case. It seems the data is not be loaded in the page and the python code has already made a request. As a result, the "tbody" and its content is empty.

I did confirm by downloading the HTML file

fh = open('sample.html','w')      
fh.write(data)      
fh.close()

There are a couple of solutions mentioned on the web to resolve this issue:

Using the Python library called dryscrape. The details are mentioned Web-scraping JavaScript page with Python
Using selenium:

from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path = 'geckodriver.exe')
driver.get("http://www.dlb.today/result/en")
time.sleep(5)
htmlSource = driver.page_source

Download geckodriver from here. Further you can use htmlsource as an input to BeautifulSoup

score 1 · Accepted Answer · answered Aug 07 '17 at 21:06

There is JavaScript involved in loading data to display this page. Fortunately the JavaScript loads another HTML page from the URL

http://www.dlb.today/result/pagination_re

You can access this URL with a POST request directly like this:

import requests
from bs4 import BeautifulSoup

url = "http://www.dlb.today/result/pagination_re"
data = {"pageId": "0", "resultID": "1001", "lotteryID": "1", "lastsegment": "en"}
page = requests.post(url, data)
soup = BeautifulSoup(page.content,'html.parser')
for data in soup.find_all("ul", {"class": "res_allnumber"}):
    print (data)

You may have to experiment with the "data" values to get exactly what you want!

The output is:

<ul class="res_allnumber"><li class="res_number">04</li><li class="res_number">30</li><li class="res_number">44</li><li class="res_number">56</li><li class="res_number" style="background-color: #971B7E; color: #fff;">29</li><li class="res_eng_letter">V</li></ul>
<ul class="res_allnumber"><li class="res_number">15</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">47</li><li class="res_number" style="background-color: #016B21; color: #fff;">69</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">09</li><li class="res_number">13</li><li class="res_number">17</li><li class="res_number">48</li><li class="res_number" style="background-color: #267FFF; color: #fff;">73</li><li class="res_eng_letter">D</li></ul>
<ul class="res_allnumber"><li class="res_number">31</li><li class="res_number">41</li><li class="res_number">43</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">52</li><li class="res_eng_letter">U</li></ul>
<ul class="res_allnumber"><li class="res_number">03</li><li class="res_number">09</li><li class="res_number">19</li><li class="res_number">73</li><li class="res_number" style="background-color: #016B21; color: #fff;">67</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">17</li><li class="res_number">22</li><li class="res_number">35</li><li class="res_number">39</li><li class="res_number" style="background-color: #267FFF; color: #fff;">59</li><li class="res_eng_letter">Z</li></ul>
<ul class="res_allnumber"><li class="res_number">08</li><li class="res_number">15</li><li class="res_number">30</li><li class="res_number">55</li><li class="res_number" style="background-color: #971B7E; color: #fff;">71</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">11</li><li class="res_number">16</li><li class="res_number">50</li><li class="res_number">57</li><li class="res_number" style="background-color: #016B21; color: #fff;">75</li><li class="res_eng_letter">Q</li></ul>
<ul class="res_allnumber"><li class="res_number">27</li><li class="res_number">30</li><li class="res_number">43</li><li class="res_number">71</li><li class="res_number" style="background-color: #267FFF; color: #fff;">63</li><li class="res_eng_letter">E</li></ul>
<ul class="res_allnumber"><li class="res_number">19</li><li class="res_number">20</li><li class="res_number">31</li><li class="res_number">43</li><li class="res_number" style="background-color: #971B7E; color: #fff;">61</li><li class="res_eng_letter">I</li></ul>
<ul class="res_allnumber"><li class="res_number">24</li><li class="res_number">41</li><li class="res_number">47</li><li class="res_number">72</li><li class="res_number" style="background-color: #016B21; color: #fff;">32</li><li class="res_eng_letter">K</li></ul>
<ul class="res_allnumber"><li class="res_number">13</li><li class="res_number">51</li><li class="res_number">61</li><li class="res_number">65</li><li class="res_number" style="background-color: #267FFF; color: #fff;">48</li><li class="res_eng_letter">E</li></ul>

can't scrape web page from BeautifulSoup or lxml

2 Answers2