I have a list of around 2000 blogs in different languages with different layout. I have two tasks: identify dead links and identify blogs that haven't been updated for more than 90 days. While the first task is easy the second one is giving me headache.
Examples:
https://www.adamsmith.org/blog
http://allfinancialmatters.com/ (this one hasn't been updated for more than 90 days)
I have tried:
- extract year with regex together with 10 characters before and 10 after and try to parse use dateparser - doesn't really work
- use javascript:alert(document.lastModified) - it doesn't work for dynamically generated sites
- use wayback machine - far too innacurate
Does anybody have another idea how to approach this task?
I am using Python.