0

I'm making a news aggregator using Python and Scrapy and cannot find an answer for exactly what I'm trying to do.

I am scraping a line of text from an article, a publish time, like so:

item['published'] = hxs.select('//div[@class="date"]/text()').extract()

This is what I'm getting back (there is no ISO date on the site, as there are some of the others I'm scraping for this project):

Last Updated: Tuesday, March 11, 2014

I need to put these dates and times into a format that I can also convert other sources' publish times and so that I can order them chronologically later via that key in the JSON feed.

So with a date in that format, how can I convert it to a usable form? I'd like in the end to have all the ISO dates and those written-out text formats converted to something like this:

Published: 2:15 p.m., March 15, 2014.
Chris
  • 249
  • 5
  • 18

3 Answers3

2

I think you want to use dateutil.parser.parse. Here's the documentation. It handles a variety of formats. On debian-style OSes, it's available in the package python-dateutil.

If this answer doesn't fully answer your question, please comment and I'll try to updated it appropriately.

jrennie
  • 1,937
  • 12
  • 16
2

Edit: jrennie's solution above is way cleaner than mine.

This works. I use strptime in order to get a solution. Note, since there is no hh:mm data in the original string, I can't output any hh:mm data like you did in your example.

Step by step solution:

>>> import time
>>> t = "Last Updated: Tuesday, March 11, 2014"
>>> t = t.rsplit(' ',4)[1:5] # Get a list of the relevant date fields
['Tuesday,', 'March', '11,', '2014']
>>> t = ' '.join(t) # Turn t into a string so we can use strptime
'Tuesday, March 11, 2014'
>>> t = time.strptime(t, "%A, %B %d, %Y") # Use strptime
time.struct_time(tm_year=2014, tm_mon=3, tm_mday=11, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=70, tm_isdst=-1)

One liner:

import time

t = "Last Updated: Tuesday, March 11, 2014"
time.strptime(' '.join(t.rsplit(' ',4)[1:5]), "%A, %B %d, %Y")

This results a struct_time. You may end up wanting convert these to datetimes, depending on how you wish to manipulate them.

Community
  • 1
  • 1
nfazzio
  • 498
  • 1
  • 3
  • 12
2

Today a good way to do that is to use the dateparser project from the scrapy team: https://github.com/scrapinghub/dateparser

Marius
  • 2,946
  • 1
  • 18
  • 18