0

I am trying to get the image links from Twitter posts. When I tried to do this with the Twitter API, I only get the link of images if they are posted as images but most newspapers post a link to their whole report including an image. I am trying to get the image inside that link. This is normally the basic way to get the link example. However, in my case, an image can be hidden under a link like this example from twitter. I have tried solutions like beautifulsoup but the HTML response I get doesn't make any sense. It doesn't include any of the attributes or classes I am searching for. This is the code I tried:

    url = "https://twitter.com/derspiegel/status/1511961740694614016"
    res = requests.get(url)
    
    soup = BeautifulSoup(res.content, "html.parser")
    print(soup)
    comicElem = soup.find_all(class_="css-9pa8cd")
    print(comicElem)

Does anyone have an idea to get this done with Twitter API or any other method?

  • Did you try with the twitter API or logging in? The scrambled HTML response is because Twitter renders the page using Javascript, you can test it by executing: `print(soup.find('p').text)` right after soup – Fabio Craig Wimmer Florey Apr 11 '22 at 00:55
  • I get this "We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center." How to fix this? –  Apr 11 '22 at 01:00
  • You really can't circumvent it with the requests module (https://github.com/gottfrois/link_thumbnailer/issues/141) but here's a solution (partial) because twitter is against unauthorized bots: https://stackoverflow.com/questions/71444292/how-to-fix-javascript-is-not-available-output-when-web-scraping-dynamic-sites-w as for your question, it's a duplicate now :) – Fabio Craig Wimmer Florey Apr 11 '22 at 01:03
  • Thank you for your response. Do you know a way to do this with Twitter API? or it is impossible? –  Apr 11 '22 at 01:06
  • you may have the most common problem: page uses `JavaScript` to add/update elements but `BeautifulSoup`/`lxml`, `requests`/`urllib` can't run `JS`. You may need [Selenium](https://selenium-python.readthedocs.io/) to control real web browser which can run `JS`. OR use (manually) `DevTools` in `Firefox`/`Chrome` (tab `Network`) to see if `JavaScript` reads data from some URL. And try to use this URL with `requests`. `JS` usually gets `JSON` which can be easy converted to Python dictionary (without `BS`). You can also check if page has (free) `API` for programmers. – furas Apr 11 '22 at 01:38

1 Answers1

-1

Use selenium that's example cope tutorials website :

from selenium import webdriver
#set chromedriver.exe path
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get("https://www.tutorialspoint.com/index.htm");
#open file in write and binary mode
with open('Logo.png', 'wb') as file:
#identify image to be captured
   l = driver.find_element_by_xpath('//*[@alt="Tutorialspoint"]')
#write file
   file.write(l.screenshot_as_png)
#close browser
driver.quit()

Copying image by id Link to tutorial

None
  • 41
  • 7