Using beautifulsoup to extract data that's hard to identify

Question

So I have a page with the following HTML, its obviously very poorly done but I need to run some automation and part of that includes getting the date below.

<tr>
     <td class="bold">
        Last Login
     </td>
     <td colspan="3" class="usual">
        4/1/2011 at 07:01:11 AM         </td>
  </tr>

Ideally I'd like to extract the contents of the second <td> and then convert it to Unix time but just grabbing it will be enough.

I was thinking this could be done with regex but you would have to reiterate through it a couple times to pull the contents.

score 1 · Accepted Answer · edited May 23 '17 at 12:22

If you are asking how to locate the desired element with BeautifulSoup, I would actually locate it based on the Last Login cell text which sounds pretty solid (though I don't know what the bigger picture is):

import re

from bs4 import BeautifulSoup

data = """
<tr>
     <td class="bold">
        Last Login
     </td>
     <td colspan="3" class="usual">
        4/1/2011 at 07:01:11 AM         </td>
</tr>
"""

soup = BeautifulSoup(data)
last_login = soup.find("td", text=re.compile(r"Last Login")).find_next_sibling("td").get_text(strip=True)
print last_login

Which prints 4/1/2011 at 07:01:11 AM.

To get the timestamp, load the string into a datetime object using strptime() and use solutions from Convert datetime to Unix timestamp and convert it back in python to get the timestamp:

from datetime import datetime
import time

last_login_date = datetime.strptime(last_login, "%m/%d/%Y at %H:%M:%S %p")
print(time.mktime(last_login_date.timetuple()))

Using beautifulsoup to extract data that's hard to identify

1 Answers1