1

I want to scrape the data on this link

http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json

I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:

import requests

url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text

The type of the source is unicode. I also try to use the urllib2 to scrape like:

source2=urllib2.urlopen(url).read()

The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/

Thanks.

Mr_Pi
  • 31
  • 8

2 Answers2

0

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets

return_json("JSON code to copy")

In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

narko
  • 3,645
  • 1
  • 28
  • 33
  • That's what I wrote. The content inside the brackets is the JSON data that you need. And it is valid. I validated it using the service I pointed out. – narko Nov 10 '16 at 14:48
  • And I provided a code answer instead of a link. OP shouldnt need to copy that long response manually – OneCricketeer Nov 10 '16 at 14:51
  • I am not saying you need to copy the JSON response manually in your code. I was just trying to show that it is valid JSON. Just extract the JSON data from the response and do what you need in your code. If you need help handling json data from python I suggest you read the official docs: https://docs.python.org/2/library/json.html – narko Nov 10 '16 at 14:54
  • I don't need the link. I'm just saying your answer could be better (as in example code along with the link) – OneCricketeer Nov 10 '16 at 14:55
  • Thanks for reply. I now can confirm it is the json page. – Mr_Pi Nov 12 '16 at 16:26
0

The response is text. It does contain JSON, just need to extract it

import json

strip_len = len("return_json(")

source=requests.get(url).text[strip_len:-2]
source = json.loads(source) 
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks for reply. I tried this method before but I don't know I should strip the 'return_json('. One comment, the correct code of 3rd line should be `source=requests.get(url).text[strip_len:-2]`, not -1. – Mr_Pi Nov 12 '16 at 16:27
  • I couldn't see the end of the response, but yes, you should strip that as it isn't part of the JSON – OneCricketeer Nov 12 '16 at 16:44
  • Basically, that URL is returning something that is meant to be queried by javascript, not python. http://stackoverflow.com/a/7613857/2308683 – OneCricketeer Nov 12 '16 at 16:46