0

I have a list of around 2000 blogs in different languages with different layout. I have two tasks: identify dead links and identify blogs that haven't been updated for more than 90 days. While the first task is easy the second one is giving me headache.

Examples:

http://100mirror.com/

https://www.adamsmith.org/blog

http://allfinancialmatters.com/ (this one hasn't been updated for more than 90 days)

I have tried:

  • extract year with regex together with 10 characters before and 10 after and try to parse use dateparser - doesn't really work
  • use javascript:alert(document.lastModified) - it doesn't work for dynamically generated sites
  • use wayback machine - far too innacurate

Does anybody have another idea how to approach this task?

I am using Python.

Dragonthoughts
  • 2,180
  • 8
  • 25
  • 28
pawelty
  • 1,000
  • 8
  • 27

2 Answers2

1

First check for current year in whole html. eg(2018)

years = re.findall('.*2018.*', str(res.content)

Iterate to each record, and find if there is any month available and it is from past 3 months( 4, 5, 6, Mar, Apr, May, Jun), if yes return blog has been updated within 90 days, else consider it no.

re.findall('.*(Jun|Mar|Apr).*', years[0])
chirag sanghvi
  • 652
  • 6
  • 8
0

Regardless of the format of a blog or it's language, it's safe to assume that the date format for each blog stays the same throughout the blog. I would build different regex searches for all the date types I can think about:

  1. dd/mm/yy
  2. dd-mm-yy
  3. Month dd, yyyy
  4. yyyy.mm.dd

And so on... and look for all of them. Once there's a match of one of them on a page, get the maximum date on the main page and that will usually represent the last time the blog was updated.

If there is for specific sites no match at all on any of the formats you could think up, look up what format that site uses and add that format as well with another regex, and repeat.

Also, you can regex just for the numbers 2018 or 18, if they are not anywhere to be found, the site is probably last updated in 2017 (but that's only true now of course, and the logic will fail if you're just starting 2019 and so on...)

Sorry for not bothering with examples code, but you didn't either :) This is just the basic algorithm that I would use and improve. You can check out date regex examples here: Regular Expression to match valid dates

You might also use some of the answers from here: Check if string has date, any format

Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62