0

I'm trying to see if I can pull data using the code below. For some reason, the beautifulsoup printout doesn't contain the data I see. I'm wondering where I've gone wrong. I've been trying different kind of headers, which is where I think my problem is but I may be wrong. For example I'm unable to find the following path when I inspect the page on the browser: <div class="textbold font-medium ng-binding">$25,000</div>

import urllib2
from bs4 import BeautifulSoup
url='https://www.prosper.com/listings#/detail/4964721'
hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}
req=urllib2.Request(url,headers=hdr)
html = urllib2.urlopen(req)
soup=BeautifulSoup(html,"lxml")
print soup
FancyDolphin
  • 459
  • 1
  • 7
  • 25
  • can you share what data you are seeing and also what you should be seeing? – Sanj Apr 02 '16 at 06:53
  • it's pretty big, but you can just see the page on the browser, and print out the page using the code I provided and you'll see they aren't the same thing. I've provided a small example, let me know if that's not enough. – FancyDolphin Apr 02 '16 at 06:54
  • Most of the page seems to be generated by JavaScript code, interpreted in the browser. But BeautifulSoup does not have a JavaScript engine. You could try to use Selenium, for example. See http://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python. – mzjn Apr 02 '16 at 07:11
  • @mzjn I'm trying to stay away from Selenium primarily because of it being slow even with a headless browser. But if it's the only way I'll reluctantly concede. – FancyDolphin Apr 02 '16 at 07:13
  • @mzjn To me, it seems like there's a refresh that happens, the first time you hit the website it doesn't have the data then the browser refreshes and gives the data, which may not be a javascript issue – FancyDolphin Apr 02 '16 at 07:20

1 Answers1

3

url reponse has to be read like this

html = urllib2.urlopen(req).read()

Based on your example, it appears you are looking for rendered html.

In your case, an ajax request is made to

"https://www.prosper.com/listings/search?options=%7B%22listing_number%22:4964721,%22resp_fields%22:%22BROWSE_LISTING%22,%22orderservice_call%22:%22Y%22%7D"

Response to this ajax request is a json which gets rendered on to the UI.

Sanj
  • 3,879
  • 2
  • 23
  • 26
  • I don't think that's the problem or necessarily an issue, you still can't get the example I've shown. – FancyDolphin Apr 02 '16 at 07:09
  • an ajax request to "https://www.prosper.com/listings/search?options=%7B%22listing_number%22:4964721,%22resp_fields%22:%22BROWSE_LISTING%22,%22orderservice_call%22:%22Y%22%7D" gets executed on page load. – Sanj Apr 02 '16 at 07:25
  • a) Loved that reply, how did you find that ajax request? and b) do you think I can pretty much work with that and avoid selenium? (C) thank you – FancyDolphin Apr 02 '16 at 07:37
  • a) Firebug to the rescue. b) I think Selenium is the only option i know. c) You are welcome. – Sanj Apr 02 '16 at 07:38