0

I'm working on a python web scraper to try to grab information for a project I'm doing. I'm using it on twitter atm as I found the twitter api wouldn't grab information any older than a week. The code I'm using is:

import urllib
import urllib.request
from bs4 import BeautifulSoup as soup

my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'

page_html = urllib.request.urlopen(my_url)
page_soup = soup(page_html, "html.parser")

print(page_soup.title.text)

for tweet in page_soup.findAll('p', {'class': 'TweetTextSize'}, lang='en'):
    print(tweet.text)

From my understanding, the attribute part of findAll can use a colon to use as a LIKE function and that seems to work okay. the specific part of the HTML I'm looking at using 'findAll' is:

<p class="TweetTextSize  js-tweet-text tweet-text" lang="en" data-aria-
label-part="0"></p>

Now I've looked through the other tweets and they all seem to use this class however I cannot figure out why it will only return 1 tweet. Strange thing is, it's not even the first tweet (it's the second).

If someone could point me in the right direction that'd be great. Thanks.

PS: I'd also like to ask if there was a way to grab ALL the tweets. When browsing through the HTML, I found that there was a class called "stream-container" which had an element 'data-min-position' which would change whenever you scrolled down and open up new tweets. I'm thinking even if my code did work it might not be able to see ALL the results of the search and only grab from the initial page. Thanks.

Edit: noticed my code was using a url with lang='en' so a little redundant but it doesn't seem to affect it at all

2 Answers2

0

Try this:

  my_url = 'https://twitter.com/search?q=australian%20megafauna&src=typd&`lang=en'

  page_html = urllib.urlopen(myurl).read()

It should work. With python3 you can do this:

import urllib.request
with urllib.request.urlopen(my_url) as f:
  page_html = f.read()
Mekicha
  • 811
  • 9
  • 21
  • Still only giving me that 1 tweet. Didn't really work with the ` in the url so I just removed the quotes around 'en'. – Amazed Bystander Aug 04 '17 at 10:18
  • Are you adding the `.read()` method? And `print` your `page_html` again to see if anything changed. – Mekicha Aug 04 '17 at 10:39
  • Yup, I tried once more just to make sure. The .read() method doesn't seem to work with .request before urlopen so I kept it in there. page_html still only came up with just those two, so I'm not sure whats going on. Still haven't figured out why this is happening (maybe reading off a cached version of the site?), but I suppose for selenium it opens the page in browser which i suppose helps. Thanks for all the help! – Amazed Bystander Aug 04 '17 at 11:02
  • One more question though. Are you using python2 or python3? – Mekicha Aug 04 '17 at 11:12
  • Ah, it was python3 – Amazed Bystander Aug 04 '17 at 12:02
0

Thanks for all the help. So I still haven't figured out why my urlrequest was providing me with an incomplete version of the page html. However I've found a work around using selenium as @ksai suggested.

Here's what it looks like:

Web Scraper

import urllib
import urllib.request
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time

myurl = 'https://twitter.com/search?q=australian%20megafauna&src=typd&lang=en'

driver = webdriver.Firefox()
driver.get(myurl)
#scroll-automation using selenium
lenOfPage = driver.execute_script("window.scrollTo(0, 
document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
lenOfPage;")
match=False
while(match==False):
    lastCount = lenOfPage
    time.sleep(3)
    lenOfPage = driver.execute_script("window.scrollTo(0, 
document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return 
lenOfPage;")
    if lastCount==lenOfPage:
        match=True

page_html = driver.page_source
page_soup = soup(page_html, "html.parser")


print(page_soup.title.text)
for tweet in page_soup.findAll('p', {'class': 'tweet-text'}, lang='en'):
    print(tweet.text)

So I had absolutely no idea how selenium worked so I just appropriated someone else's solution for scrolling: How to scroll to the end of the page using selenium in python

@ksai, would there have been an alternate way you would've done it?

I'm planning to just store the tweets in a csv file as text, would there be a format if you were planning to use it to train a bot?

Thanks

  • For dynamic websites like facebook, twitter, linkedin etc, I recommend using `selenium` – ksai Aug 04 '17 at 10:53
  • `csv` format is best for data-analysis and model's training purposes. – ksai Aug 04 '17 at 10:57
  • Tbh I didn't even know about selenium, although I suppose all the guides I'd been using were using static pages as examples. So thanks for letting me know about it! – Amazed Bystander Aug 04 '17 at 11:04
  • Is it alright to just keep the tweets as a single line of text? Or do you think it would be better to separate them by word and have each sentence in a row? – Amazed Bystander Aug 04 '17 at 11:05
  • I am not an expert in this field. You would like to google for more information on this. – ksai Aug 04 '17 at 16:51