python scrape webpage and parse the content

Question

I want to scrape the data on this link

http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json

I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:

import requests

url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text

The type of the source is unicode. I also try to use the urllib2 to scrape like:

source2=urllib2.urlopen(url).read()

The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/

Thanks.

@depperm, thanks for reply. I update the link. It should work now. — Mr_Pi, Nov 10 '16 at 14:25

narko · Answer 1 · 2016-11-10T14:55:31.930

0

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets

return_json("JSON code to copy")

In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

edited Nov 10 '16 at 14:55

answered Nov 10 '16 at 14:33

narko

3,645
1
28
33

That's what I wrote. The content inside the brackets is the JSON data that you need. And it is valid. I validated it using the service I pointed out. – narko Nov 10 '16 at 14:48
And I provided a code answer instead of a link. OP shouldnt need to copy that long response manually – OneCricketeer Nov 10 '16 at 14:51
I am not saying you need to copy the JSON response manually in your code. I was just trying to show that it is valid JSON. Just extract the JSON data from the response and do what you need in your code. If you need help handling json data from python I suggest you read the official docs: https://docs.python.org/2/library/json.html – narko Nov 10 '16 at 14:54
I don't need the link. I'm just saying your answer could be better (as in example code along with the link) – OneCricketeer Nov 10 '16 at 14:55
Thanks for reply. I now can confirm it is the json page. – Mr_Pi Nov 12 '16 at 16:26

OneCricketeer · Answer 2 · 2016-11-12T16:44:15.840

0

The response is text. It does contain JSON, just need to extract it

import json

strip_len = len("return_json(")

source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

edited Nov 12 '16 at 16:44

answered Nov 10 '16 at 14:43

OneCricketeer

179,855
19
132
245

Thanks for reply. I tried this method before but I don't know I should strip the 'return_json('. One comment, the correct code of 3rd line should be `source=requests.get(url).text[strip_len:-2]`, not -1. – Mr_Pi Nov 12 '16 at 16:27
I couldn't see the end of the response, but yes, you should strip that as it isn't part of the JSON – OneCricketeer Nov 12 '16 at 16:44
Basically, that URL is returning something that is meant to be queried by javascript, not python. http://stackoverflow.com/a/7613857/2308683 – OneCricketeer Nov 12 '16 at 16:46

python scrape webpage and parse the content

2 Answers2