0

Good morning! I recently have been trying to web scrape the imgur website of a hosted video, in particular the video length, as you can see in the image. I tried all the imaginable possible ways (be via python autiful soup, requests library, etc) to get this data, but everytime I do it, or I receive a html file that has nothing do with the data I need, or a completely blank response. I can´t use selenium since the code needs to run in heroku, so I don´t have any idea on how to do this. Thanks to all who spare some time helping me! image

1 Answers1

0

If you look at the network tab, you will see that imgur provides api for each post. So, instead of scraping website itself, you can work on api to get desired output:

import json
import requests

url = 'https://api.imgur.com/post/v1/media/xsmX53B?client_id=<your_client_id>&include=media%2Cadconfig%2Caccount'
response_json = requests.get(url).json()

This will get you:

{"id":"xsmX53B","account_id":0,"title":"","description":"","view_count":642,"upvote_count":0,"downvote_count":0,"point_count":0,"image_count":1,"comment_count":0,"favorite_count":0,"virality":0,"score":0,"in_most_viral":false,"is_album":false,"is_mature":false,"cover_id":"xsmX53B","created_at":"2021-03-08T16:51:58Z","updated_at":null,"url":"https://imgur.com/xsmX53B","privacy":"","vote":null,"favorite":false,"is_ad":false,"ad_type":0,"ad_url":"","include_album_ads":false,"shared_with_community":false,"is_pending":false,"platform":"api","ad_config":{"show_ads":false,"safe_flags":["not_in_gallery","share"],"high_risk_flags":[],"unsafe_flags":["sixth_mod_unsafe"],"wall_unsafe_flags":[]},"media":[{"id":"xsmX53B","account_id":0,"mime_type":"video/mp4","type":"video","name":"","basename":"","url":"https://i.imgur.com/xsmX53B.mp4","ext":"mp4","width":960,"height":540,"size":14009178,"metadata":{"title":"","description":"","is_animated":true,"is_looping":true,"duration":59.92,"has_sound":false},"created_at":"2021-03-08T16:51:58Z","updated_at":null}],"display":[]}

Now you can scrape duration from json file easily:

duration = response_json['media'][0]['metadata']['duration']

Note that I hid my client_id; therefore, don't forget to change it to yours. Finally, duration will look like this:

59.92
Rustam Garayev
  • 2,632
  • 1
  • 9
  • 13
  • Thank you, really. I have been hours and hours in this without finding anything. Just a question: do I really need a client id? I tried to get the json without it and it was working... – Omni Master Mar 09 '21 at 09:47
  • I haven't checked specifically for imgur but some api's show result only for authenticated users, so it is better to include it. If you didn't face any problems, then it is fine – Rustam Garayev Mar 09 '21 at 09:52
  • 1
    Just tried it. If I use it in python I get a 429 error saying I exceed the limit so I guess I really need do it. Really thanks! – Omni Master Mar 09 '21 at 09:55
  • Note that 429 error is not authentication error. It means you have sent numerous requests to website so they blocked you for some time. Try not to spam it :) – Rustam Garayev Mar 09 '21 at 09:56
  • Its quite interesting I can still use the search engine to get the data but when I use python I always get that error. – Omni Master Mar 09 '21 at 10:00
  • check this thread for more information: https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python#:~:text=Receiving%20a%20status%20429%20is,not%20willing%20to%20accept%20this. – Rustam Garayev Mar 09 '21 at 10:02