2

scrape.py

# code to scrape the links from the html

from bs4 import BeautifulSoup
import urllib.request

data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links

links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):

    # print(div.a.get('href'))
    links.append('https://godamwale.com' + str(div.a.get('href')))


print(links)
file = open("links.txt", "w")
for link in links:

    file.write(link + '\n')
    print(link)

I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me. I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.

https://godamwale.com/list/result/591359c0d6b269eecc1d8933

it 's link here . If someone finds solution , please give it to me.

kd007
  • 339
  • 1
  • 13
  • have someone done this before? – kd007 Jan 05 '19 at 11:10
  • "which don't have any of the source code" didn't get it? what do you mean explain in detail – Dev Jan 05 '19 at 11:14
  • When i used ctrl + u to see the source code , it just show code which don't have data in it , but i want to scrap the data , and i find data , when i inspect the code. – kd007 Jan 05 '19 at 11:16
  • you said you got links but not mentioned what you want to do next – Dev Jan 05 '19 at 11:17
  • I want to scrap data from those links , one by one and then put them in a excel file – kd007 Jan 05 '19 at 11:17
  • yes you want data but on page there is many more data you want to scrape all data under heading of construction details, licence and automation, commercial details? or something else you looking please mention that what data or which section from page you want to scrape ? – Dev Jan 05 '19 at 11:23
  • Actually I want to scrap whole data from the page? – kd007 Jan 05 '19 at 11:23
  • Each and every data, under heading that you have mentioned construction details, licence and automation , commercial details . – kd007 Jan 05 '19 at 11:24
  • please add it to question also it is crucial to know what you intend to happen – Dev Jan 05 '19 at 11:28
  • yes definitely , have edited the question , are you able to get the solution for that problem – kd007 Jan 05 '19 at 11:34
  • yes, have look at my answer – Dev Jan 05 '19 at 19:01
  • FYI it's __scrape__ (and __scraping__, __scraped__, __scraper__) not scrap – DisappointedByUnaccountableMod Apr 23 '21 at 09:26

2 Answers2

2

Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.

Python 2.x:

import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents

Python 3.x:

import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
  • Traceback (most recent call last): File "scrapFile.py", line 48, in contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read()) File "/usr/lib/python3.5/json/__init__.py", line 312, in loads s.__class__.__name__)) TypeError: the JSON object must be str, not 'bytes' – kd007 Jan 05 '19 at 12:23
  • It is showing this error while i am running your code – kd007 Jan 05 '19 at 12:24
  • It works for me on Python3.6.6 (which python version are you using?), however I've take a guess as to why it might not be working for you and updated my answer. You might like to check the following for a more robust solution https://stackoverflow.com/questions/32795460/loading-json-object-in-python-using-urllib-request-and-json-modules – Trevor Ian Peacock Jan 05 '19 at 12:33
  • @TrevorIanPeacock, can you quickly explain where you see/find that it makes that request and returns a json response? – chitown88 Jan 05 '19 at 12:44
  • 1
    @chitown88 I haven't dug in to the javascript to see where/how that particular site makes that request, but by simple inspection of the requests made in the developer console I could see the request and response. See the below link about opening developer tools in your browser. If you wanted to determine exactly where/how the call is made, this is also where I would start. https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools – Trevor Ian Peacock Jan 05 '19 at 12:57
0

Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe" 

wd = webdriver.Chrome(CHROMEDRIVER_PATH)

responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")

time.sleep(8)  # wait untill page loads completely 

soup = BeautifulSoup(wd.page_source, 'lxml')

props_list = []
propvalues_list = []

div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
    props = childtags.find("span").contents
    props_list.append(props)

    propvalue = childtags.find("p",recursive=True).contents
    propvalues_list.append(propvalue)

print(props_list)
print(propvalues_list)

note: code will return Construction details in 2 seperate list.

Dev
  • 2,739
  • 2
  • 21
  • 34