1

I need make a regular expression to take the time and date from a text. I tried:

re.compile("title=\".* js-short-timestamp")

I need get only something like:

21:14 - 2 de out de 2013
15:13 - 1 de out de 2013
14:16 - 1 de out de 2013
15:58 - 14 de set de 2013
16:06 - 13 de set de 2013
14:59 - 13 de set de 2013
12:43 - 13 de set de 2013
09:33 - 13 de set de 2013

obs: ( i used some re.sub to get only these things) But sometimes I'm getting:

18:30 - 11 de jul de 2011 href=https://twitter.com/XXXXXXXX/status/90533484464054272 
22:10 - 3 de jul de 2011 href=https://twitter.com/XXXXXXXXX/status/87689583726313472 

Example of my text:

(obs the first with a-data-original-title is my problem because I'm getting href.. and I don't want it.)

    <a data-original-title="16:06 - 17 de jun de 2013" href="https://twitter.com/XXXXXXXX/status/346705537934712832" class="tweet-timestamp js-permalink js-nav js-tooltip"><span class="_timestamp js-short-timestamp " data-time="1371496016" data-long-form="true">17 de jun</span></a>
</small>

   <a href="https://twitter.com/XXXXXXXX/status/407906654579998720" class="tweet-timestamp js-permalink js-nav js-tooltip" title="14:18 - 3 de dez de 2013">span class="_timestamp js-short-timestamp " data-time="1386087499" data-long-form="true">3 de dez</span></a>

2 Answers2

2

You are trying to parse HTML with regular expressions, this rarely ends well.

I'd use a HTML parser instead. I can recommend you install BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeatifulSoup(html_page_source)

timestamps = soup.find_all('a', class=_'tweet-timestamp', {'data-original-title': True})
for timestamp in timestamps:
    print timestamp['data-original-title']

This finds all <a> tags with (at least) the class tweet-timestamp and a data-original-title attribute, then prints that attribute.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

This should be a better regex to use

time_re = re.compile(r'data-original-title="([^"]+).*js-short-timestamp')

and then you can use findall

time_re.findall(s) # where s is you html string

EDIT:

to do both versions you need a more complex regex

time_re = re.compile(r'data-original-title="([^"]+).*js-short-timestamp|tweet-timestamp.*title="([^"]+)"')

[filter(None, x)[0] for x in time_re.findall(s)] # where s is your html string
Pykler
  • 14,565
  • 9
  • 41
  • 50
  • Its very good! But, i have two types of text, like one with data-orinal-title and another with only title. There`s a way to take them together? look my example there you can see what im talking about – user2333163 Feb 06 '14 at 11:35
  • You are probably better off with an html parser as @Martjin pointed out ... but for a regex I updated my answer to work for both. The problem with the second one is the class is before the attr (and twitter might change their format at any time so using an HTML parser will always work). – Pykler Feb 06 '14 at 11:38
  • i changed your expression for re.compile(r'title="([^"]+).*js-short-timestamp') And now its all ok, Thank you! – user2333163 Feb 06 '14 at 11:41
  • Yes that will also work since the js-short-timestamp class is attached to the inner element in the second type. – Pykler Feb 06 '14 at 11:46