Good morning! I recently have been trying to web scrape the imgur website of a hosted video, in particular the video length, as you can see in the image. I tried all the imaginable possible ways (be via python autiful soup, requests library, etc) to get this data, but everytime I do it, or I receive a html file that has nothing do with the data I need, or a completely blank response. I can´t use selenium since the code needs to run in heroku, so I don´t have any idea on how to do this. Thanks to all who spare some time helping me! image
Asked
Active
Viewed 64 times
0
-
It's __Webscraping__ not webscrapping – DisappointedByUnaccountableMod Mar 15 '21 at 16:07
1 Answers
0
If you look at the network tab, you will see that imgur
provides api for each post. So, instead of scraping website itself, you can work on api to get desired output:
import json
import requests
url = 'https://api.imgur.com/post/v1/media/xsmX53B?client_id=<your_client_id>&include=media%2Cadconfig%2Caccount'
response_json = requests.get(url).json()
This will get you:
{"id":"xsmX53B","account_id":0,"title":"","description":"","view_count":642,"upvote_count":0,"downvote_count":0,"point_count":0,"image_count":1,"comment_count":0,"favorite_count":0,"virality":0,"score":0,"in_most_viral":false,"is_album":false,"is_mature":false,"cover_id":"xsmX53B","created_at":"2021-03-08T16:51:58Z","updated_at":null,"url":"https://imgur.com/xsmX53B","privacy":"","vote":null,"favorite":false,"is_ad":false,"ad_type":0,"ad_url":"","include_album_ads":false,"shared_with_community":false,"is_pending":false,"platform":"api","ad_config":{"show_ads":false,"safe_flags":["not_in_gallery","share"],"high_risk_flags":[],"unsafe_flags":["sixth_mod_unsafe"],"wall_unsafe_flags":[]},"media":[{"id":"xsmX53B","account_id":0,"mime_type":"video/mp4","type":"video","name":"","basename":"","url":"https://i.imgur.com/xsmX53B.mp4","ext":"mp4","width":960,"height":540,"size":14009178,"metadata":{"title":"","description":"","is_animated":true,"is_looping":true,"duration":59.92,"has_sound":false},"created_at":"2021-03-08T16:51:58Z","updated_at":null}],"display":[]}
Now you can scrape duration from json file easily:
duration = response_json['media'][0]['metadata']['duration']
Note that I hid my client_id
; therefore, don't forget to change it to yours. Finally, duration
will look like this:
59.92

Rustam Garayev
- 2,632
- 1
- 9
- 13
-
Thank you, really. I have been hours and hours in this without finding anything. Just a question: do I really need a client id? I tried to get the json without it and it was working... – Omni Master Mar 09 '21 at 09:47
-
I haven't checked specifically for imgur but some api's show result only for authenticated users, so it is better to include it. If you didn't face any problems, then it is fine – Rustam Garayev Mar 09 '21 at 09:52
-
1Just tried it. If I use it in python I get a 429 error saying I exceed the limit so I guess I really need do it. Really thanks! – Omni Master Mar 09 '21 at 09:55
-
Note that 429 error is not authentication error. It means you have sent numerous requests to website so they blocked you for some time. Try not to spam it :) – Rustam Garayev Mar 09 '21 at 09:56
-
Its quite interesting I can still use the search engine to get the data but when I use python I always get that error. – Omni Master Mar 09 '21 at 10:00
-
check this thread for more information: https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python#:~:text=Receiving%20a%20status%20429%20is,not%20willing%20to%20accept%20this. – Rustam Garayev Mar 09 '21 at 10:02