Infinite Web Scraping Twitter

Question

I'm trying to web scrape Twitter using Python 3.X but I only collect the last 20 tweets of my request. I would like to collect whole data of a request between 2006 and now. For this I think to have create two more function: one which will collect the last tweets and one which will collect the current tweets? And how can I collect the data from this scrolling page? I think that I have to use the tweet's id but no matter the request I do it's always the last 20 tweets that I get.

from pprint import pprint
from lxml import html
import requests
import datetime as dt
from BeautifulSoup import BeautifulSoup

def search_twitter(search):
    url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr"
    request = requests.get(url)
    sourceCode = BeautifulSoup(request.content, "lxml")
    tweets = sourceCode.find_all('li', 'js-stream-item')
    return tweets

def filter_tweets(tweets):
    data = []
    for tweet in tweets:
        if tweet.find('p', 'tweet-text'):
            dtwee = [
                ['id', tweet['data-item-id']],
                ['username', tweet.find('span', 'username').text],
                ['time', tweet.find('a', 'tweet-timestamp')['title']],
                ['tweet', tweet.find('p', 'tweet-text').text.encode('utf-8')]]
            data.append(dtwee)
            #tweet_time = dt.datetime.strptime(tweet_time, '%H:%M - %d %B %Y')
        else:
            continue
    return data

def firstlastId_tweets(tweets):
    firstID = ""
    lastID = ""
    i = 0
    for tweet in tweets:
        if(i == 0):
            firstID = tweet[0][1]
        elif(i == (len(tweets)-1)):
            lastID = tweet[0][1]
        i+=1
    return firstID, lastID

def last_tweets(search, lastID):
    url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr&max_position=TWEET-"+lastID
    request = requests.get(url)
    sourceCode = BeautifulSoup(request.content, "lxml")
    tweets = sourceCode.find_all('li', 'js-stream-item')
    return tweets

tweets = search_twitter("lol")
tweets = filter_tweets(tweets)
pprint(tweets)
firstID, lastID = firstlastId_tweets(tweets)
print(firstID, lastID)
while True:
    lastTweets = last_tweets("lol", lastID)
    pprint(lastTweets)
    firstID, lastID = firstlastId_tweets(lastTweets)
    print(firstID, lastID)

score 0 · Answer 1 · edited Jul 26 '16 at 09:22

I found a good solution based on this webpage:

http://ataspinar.com/2015/11/09/collecting-data-from-twitter/

What I did was creating a variable called max_pos where I stored this string:

'&max_position=TWEET-'+last_id+'-'+first_id

I stored the first_id (position1 Tweet id) and last_id (position20 Tweet id)

So for the request, I used something like this:

request = requests.get(url+max_pos) Starting with max_pos empty.

I see this can be a common issue, we could post a working solution. I still do not have it showing the results the way I need, but I could simulate the "scroll down till the end" following the guide from the link.

Infinite Web Scraping Twitter

1 Answers1

Linked