1

I'm fetch value from the URL.

import urllib2
response = urllib2.urlopen('url')    
response.read()

It's give me too long string type output, but I only put here what I have issue.

STRING TYPE OUTPUT:

'<p>Dear Customer,</p>
<p>This notice serves as proof of delivery for the shipment listed below.</p>
<dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd>
<dt><label>Shipped&#047;Billed On:</label></dt><dd>09/11/2015</dd>
<dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
<dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
<dt><label>Left At:</label></dt>
<dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'

QUESTION:

how I can take date (09/14/2015 11:07 A.M.) which is assign for Delivered On?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
Bhavesh Odedra
  • 10,990
  • 12
  • 33
  • 58
  • 1
    If the time format has constant length. you might use like re.search('Delivered On:
    (.*)$',a).group(1)[:20], where a is the string
    – Vineesh Sep 25 '15 at 08:04
  • @Vineesh, Thank you so much for your comments, your code works fine but it's fail when Delivered On: is empty. Here is error. *AttributeError: 'NoneType' object has no attribute 'group'* – Bhavesh Odedra Sep 25 '15 at 13:20
  • Can you add an check for it . Like "data = re.search('Delivered On:
    (.*)$',a)" then "if data: data.group(1)[:20]". This should handle Nonetype
    – Vineesh Sep 25 '15 at 13:29
  • I added but it's give me output this => *'
    – Bhavesh Odedra Sep 25 '15 at 13:33
  • the code is written in the answer box – Vineesh Sep 25 '15 at 13:47

5 Answers5

6

You could start by using something like Beautiful Soup or some other html parser. It might look something like this:

from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('url')    
html = response.read()
soup = BeautifulSoup(html)
datestr = soup.find("label", text="Delivered On:").find_parent("dt").find_next_sibling("dd").string

And if you need to, once you have a hold of the date string, you can use strptime to convert it to a datetime object.

import datetime
date = datetime.datetime.strptime(datestr, "%mm/%dd/%Y %I:%M %p")

Remember - you generally should not find yourself parsing HTML or XML with regexes...

jfs
  • 399,953
  • 195
  • 994
  • 1,670
stett
  • 1,351
  • 1
  • 11
  • 24
  • "Never Say Never Again". If you want to parse 1B of letters, it's better to write you own tool to parse html instead of using `BeatifulSoup`, because Soup is a tool for html analyze. And it does a lot of work, that you (probably) don't need. Also, Soup are not memory efficient. – Jimilian Sep 25 '15 at 08:33
  • haha okay yes you're right... never say never. I just was thinking about this famous question (and top answer): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – stett Sep 25 '15 at 08:36
  • Now it's much better ;) Here is your +1 :) Btw, look into second answer from that topic :) – Jimilian Sep 25 '15 at 08:36
  • @Jimilian: no, regex is even less of an answer with larger masses of XML. There are fast tools to parse XML that are not BeautifulSoup. Doesn't mean regex is the only alternative. – mike3996 Sep 25 '15 at 08:39
  • In general, there are XML parsers that build up a useable DOM presentation (like BS) and then there are parsers that read a stream of XML into a tokenized stream, usually only used when the input XML doesn't fit into the memory. – mike3996 Sep 25 '15 at 09:01
  • @stett, Thank you so much for your answer. I use your answered code but it give me error *AttributeError: 'NoneType' object has no attribute 'find_parent'* – Bhavesh Odedra Sep 25 '15 at 13:21
  • @Odedra: the label text was impresice. If you don't need the exact match; you could use `text=re.compile('Delivered On')` instead. – jfs Sep 25 '15 at 16:25
  • `strptime()` also fails. You could use `.strptime(datestr.replace('A.M.', 'am').replace('P.M.', 'pm'), "%m/%d/%Y %I:%M %p")` instead. – jfs Sep 25 '15 at 16:27
1

Try this code:

import re

text = '''<p>Dear Customer,</p>
          <p>This notice serves as proof of delivery for the shipment listed below.</p>
          <dl class="outHozFixed clearfix"><label>Weight:</label></dt>
          <dd>18.00 lbs</dd>
          <dt><label>Shipped&#047;Billed On:</label></dt>
          <dd>09/11/2015</dd>
          <dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
          <dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
          <dt><label>Left At:</label></dt>
          <dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'''

re.findall(r'<dt><label>Delivered On:<\/label><\/dt><dd>([0-9\.\/\s:APM]+)', text)

OUTPUT:

['09/14/2015 11:07 A.M.']
Bhavesh Odedra
  • 10,990
  • 12
  • 33
  • 58
1

Based on that output only, I would use re and re.search. Create a regex for finding a date with time, like this:

import re

output = '''<p>Dear Customer,</p>
            <p>This notice serves as proof of delivery for the shipment listed below.</p>
            <dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd>
            <dt><label>Shipped&#047;Billed On:</label></dt><dd>09/11/2015</dd>
            <dt><label>Delivered On:</label></dt><dd>09/14/2015 11:07 A.M.</dd>
            <dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt>
            <dt><label>Left At:</label></dt>
            <dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>'''

pattern = '\d{2}/\d{2}/\d{4} \d{1,2}:\d{2} [A|P]\.M\.'

result = re.search(pattern, text, re.MULTILINE).group(0)
makeMonday
  • 2,303
  • 4
  • 25
  • 43
  • Thank you so much. Your code works fine but it's fail when Delivered On: is empty. Here is error. AttributeError: 'NoneType' object has no attribute 'group' – Bhavesh Odedra Sep 25 '15 at 13:29
1

If you don't like regexp and third-part libraries, you always can use old-school hardcoded one-line solution:

import datetime

text_date = [item.strip() for item in input_text.split('\n') if "Delivered On:" in item][0][41:-5]
datetime.datetime.strptime(text_date.replace(".",""), "%m/%d/%Y %I:%M %p")

For one line case:

start_index = input_text.index("Delivered On:")+len("Delivered On:</label></dt><dd>")
stop_index = start_index + 21
text_date = input_text[start_index:stop_index]

Because any solution for your question will be a different type of hardcode :(

Jimilian
  • 3,859
  • 30
  • 33
1

Try this code:

import re
a = """<p>Dear Customer,</p><p>This notice serves as proof of delivery for the shipment listed below.</p><dl class="outHozFixed clearfix"><label>Weight:</label></dt><dd>18.00 lbs</dd><dt><label>Shipped&#047;Billed On:</label></dt><dd>09/11/2015</dd><dt><label>Delivered On:</label></dt><dd>12/4/2015 11:07 A.M.</dd><dt><label for="">Signed By:</label></dt><dd>Odedra</dd></dt><dt><label>Left At:</label></dt><dd>Office</dd></dl><p>Thank you for giving us this opportunity to serve you.</p>"""
data = re.search('Delivered On:</label></dt><dd>(.*)$',a)
if data and data.group(1)[:1].isdigit(): 
    data.group(1)[:20]
Vineesh
  • 253
  • 2
  • 7