-2

How can I use regexp in python to extract the date from an html <div> tags. Html is something like this

<div><strong>Date:<\/strong> Monday April 6, 2015 at 4:41PM <div>

I need to get date in "yyyy-dd-mm hh:mm" format. Output for this should be "2015-04-06 16:41"

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Aron
  • 11
  • 1
  • 3

2 Answers2

2

Instead of approaching the problem with regular expressions (see RegEx match open tags except XHTML self-contained tags), I would use an HTML Parser, BeautifulSoup, and dateutil for extracting the date. After extracting the date, use strftime() to dump it into a string in the desired format:

>>> from bs4 import BeautifulSoup
>>> from dateutil import parse
>>> s = "<div><strong>Date:</strong> Monday April 6, 2015 at 4:41PM <div>"
>>> text = soup.find('div').text
>>> parse(text, fuzzy=True).strftime("%Y-%d-%m %H:%M")
'2015-06-04 16:41'
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Isn't `lxml` more suitable for real world use cases? – hek2mgl Apr 07 '15 at 19:00
  • @hek2mgl it is opinion-based I'd say. There are different packages for the same task. Though I haven't seen a library so easy-to-use and natural as `BeautifulSoup`. – alecxe Apr 07 '15 at 19:02
  • (I'm new to Python) Isn't access to elements through XPath more straightforward and maintainable if the HTML changes than stepping through each node until the target? I mean Beatifulsoup can use lxml as the low level parser but does not support xpath. I can't get why. – hek2mgl Apr 07 '15 at 19:05
  • 1
    @hek2mgl well, these are just different mechanism to accomplish similar goals. E.g. what would be more readable for you: `//td[@class="test"]/following-sibling::div`, or `soup.find('td', class_='test').find_next_sibling('div')`? Plus, `BeautifulSoup` supports CSS selectors which you can see as an alternative to XPath to locate elements. – alecxe Apr 07 '15 at 20:12
0

This is not possible with RegEx alone as you can't match the month when it is not in the source.

SGD
  • 1,676
  • 14
  • 18