I'm working on a python web scraper to try to grab information for a project I'm doing. I'm using it on twitter atm as I found the twitter api wouldn't grab information any older than a week. The code I'm using is:
import urllib
import urllib.request
from bs4 import BeautifulSoup as soup
my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'
page_html = urllib.request.urlopen(my_url)
page_soup = soup(page_html, "html.parser")
print(page_soup.title.text)
for tweet in page_soup.findAll('p', {'class': 'TweetTextSize'}, lang='en'):
print(tweet.text)
From my understanding, the attribute part of findAll can use a colon to use as a LIKE function and that seems to work okay. the specific part of the HTML I'm looking at using 'findAll' is:
<p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-
label-part="0"></p>
Now I've looked through the other tweets and they all seem to use this class however I cannot figure out why it will only return 1 tweet. Strange thing is, it's not even the first tweet (it's the second).
If someone could point me in the right direction that'd be great. Thanks.
PS: I'd also like to ask if there was a way to grab ALL the tweets. When browsing through the HTML, I found that there was a class called "stream-container" which had an element 'data-min-position' which would change whenever you scrolled down and open up new tweets. I'm thinking even if my code did work it might not be able to see ALL the results of the search and only grab from the initial page. Thanks.
Edit: noticed my code was using a url with lang='en' so a little redundant but it doesn't seem to affect it at all