0

I am scraping the publish date of articles from a number of publishers' websites using a python script. This data is found in HTML attributes or tags identified variously by "time", "timestamp", and "published_date", among others, and provides the time in, for example, the following formats:

<time class="timestamp article__timestamp flexbox__flex--1"> Updated Aug. 18, 2021 3:54 pm ET </time>

<time class="css-x7rtpa e16638kd0" datetime="2021-08-18T19:10:54-04:00">Aug. 18, 2021</time>

<time datetime="2021-08-18T15:45:33-04:00"><span class="date">August 18, 2021</span><span class="time">3:45 PM ET</span></time>

<div class="timestamp"><span aria-label="Published on August 19, 2021 12:36 AM ET" class="timestamp__date--published"><span aria-hidden="true">08/19/2021 12:36 am ET</span></span></div>

<div class="article-date"><strong>Published</strong> <time> 8 hours ago</time></div>

'published_time': '2021-08-18T05:33:59Z

This is what the text of those dates will typically look like after I grab it from those HTML tags:

Aug. 18, 2021 6:56 am ET

Aug. 18, 2021

Updated Aug. 18, 2021 3:54 pm ET

Published 6 hours ago

2021-08-18T08:00:00Z

I plan to scrape additional publishers' sites in the future, so before I write my own script, I'm curious if there's an existing solution or framework that unifies this format.

The above tags and resulting text aren't shown in a 1:1 relationship because there's enough variation to the point where that's somewhat irrelevant for a solution beyond writing my own script. The solutions I've found so far reference unifying dates in Javascript, but not when extracting from HTML tags.

These dates will ultimately be consumed by a server app written in Swift.

Pigpocket
  • 449
  • 5
  • 24
  • 1
    The datetime.strptime will be able to handle some of these but not all. There's no generic solution. You're going to have to write code to identify what in the string(s) and process appropriately –  Aug 19 '21 at 06:24
  • 1
    Maybe: https://stackoverflow.com/a/35069076/9192284 – MDR Aug 20 '21 at 21:44

1 Answers1

0

The dateparser python library looks like the best solution to my needs.

  • Support for almost every existing date format: absolute dates, relative dates ("two weeks ago" or "tomorrow"), timestamps, etc.
  • Support for more than 200 language locales.
  • Language autodetection.
  • Customizable behavior through settings.
  • Support for non-Gregorian calendar systems.
  • Support for dates with timezones abbreviations or UTC offsets ("August 14, 2015 EST", "21 July 2013 10:15 pm +0500"...) Search dates in longer texts.

And for long-term stability, always a consideration for production deployments:

  • Actively supported
  • Over 7.4k users
  • 90+ contributors
Pigpocket
  • 449
  • 5
  • 24