0

I am trying to pull the JSON code from a urllib.request object focusing on twitter. I am doing this out of curiosity and also because I am trying to determine what to request with Scrappy in order to write code that bypasses twitter's infinite scrolling and allows me to pull all the tweets off a user's timeline.

(I know there are some packages that already do this but I want to set it up by myself to learn by doing :) )

I have been using the urllib package to get the request data, however, I have been running into a frustrating error when I attempt it:

import json
import urllib

with urllib.request.urlopen("https://twitter.com/vonkraush") as url:
    data = url.read().decode()

print(json.loads(data))

Traceback (most recent call last):

  File "<ipython-input-30-208336effb36>", line 1, in <module>
    json.loads(data)

  File "C:\Users\Josh\Anaconda3\lib\json\__init__.py", line 354, in loads
    return _default_decoder.decode(s)

  File "C:\Users\Josh\Anaconda3\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())

  File "C:\Users\Josh\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting value

I've tried expressly passing 'utf-8' into the decoding and a few other methods, but nothing has so far allowed my to bypass this error. What am I doing wrong and how can I fix it?

Josh Kraushaar
  • 369
  • 5
  • 17
  • 2
    `data` is an HTML document, not a JSON object. You cannot pass it to `json.loads()`. – DYZ Sep 10 '17 at 17:52
  • Odd, I have seen other people suggest this exact same code block on stack exchange: https://stackoverflow.com/questions/12965203/how-to-get-json-from-webpage-into-python-script What should I do instead then? – Josh Kraushaar Sep 10 '17 at 18:04
  • 1
    If there is any JSON in that page, you have to extract it from the page and then call `json.loads()`. – DYZ Sep 10 '17 at 18:06
  • Gotcha, how would you recommend extracting it then? – Josh Kraushaar Sep 10 '17 at 18:16

1 Answers1

1

You are doing it wrong. This URL will always return you an HTML page. To get user data from Twitter use Twitter Dev API.

See here, Twitter Dev API might help you to extract information from Twitter. But for that to you will have to authenticate yourself as a Twitter user. Make sure you create a Twitter app first and get your OAuth key. It will be your access to Twitter API.

Twitter API uses token based authentication. The Token that you will receive in response from the API call will be your identity as a user.

Punit Grover
  • 81
  • 1
  • 4
  • Twitter API only lets me pull a little over 3000 tweets, I am toying around with direct web scraping to see if I can pull all of a user's tweets. – Josh Kraushaar Sep 10 '17 at 18:43
  • This [Twitter Scraper](https://github.com/bpb27/twitter_scraping), might solve your problem. – Punit Grover Sep 10 '17 at 18:48
  • I am mostly doing this so I get a better handle on how web scraping with scrapy works, to learn by doing rather than a need to pull a bunch of tweets. They also use selenium rather than scrapy. – Josh Kraushaar Sep 10 '17 at 18:50
  • They are using selenium to get the tweet by automating your browser to get data from twitter using a CSS selector i.e. `li.js-stream-item`. It will only get tweets between two dates. – Punit Grover Sep 10 '17 at 19:07
  • 1
    If you are attempting to scrape the Twitter website you are violating Twitter's terms of service and developer agreement. Your IP address may be blocked by Twitter. – Andy Piper Sep 11 '17 at 08:52