2

I am doing a small script, which should read my worked time from my email and save how much time I already worked. It is doing this through regex. Now here is my Script:

import imaplib import re from pprint import pprint

mail = imaplib.IMAP4_SSL('imap.gmail.com',993)
mail.login('*************', '**************')
# Out: list of "folders" aka labels in gmail.
mail.select("inbox") # connect to inbox.

typ, data = mail.search(None, 'SUBJECT', 'Zeiterfassung')

worked_time_pattern = re.compile(r'"(?P<time>\d+(,\d)?)"[^>]*?selected[^>]*>=?(\r?\n?)(?P=time)<')
# old version: worked_time_pattern = re.compile(r'\"(?P<time>[0-9]+(?:[,][0-9])?)\"(?: disabled)? selected(?: disabled)? style=3D"">[=]?[\n]?(?P=time)<\/option>')
date_pattern = re.compile('.*Date: [a-zA-Z]{1,4}[,] (?P<date>[0-9]{1,2} [a-zA-Z]{1,4} [0-9]{4}).*', re.DOTALL)
count = 0
countFail = 0
if 'OK' == typ:
    for num in data[0].split():
        typ, data = mail.fetch(num, '(RFC822)')
        mailbody = "".join(data[0][1].split("=\r\n"))
        mailbody = "".join(mailbody.split("\r"))
        mailbody = "".join(mailbody.split("\n"))
        worked_time = worked_time_pattern.search(data[0][1])
        date = date_pattern.match(data[0][1])
        if worked_time != None:
            print worked_time.group('time')
            count = count + 1
        else:
            print mailbody
            countFail = countFail + 1
        print worked_time
        print "You worked  on %s\n" % ( date.group('date'))
        #print 'Message %s\n%s\n' % (num, data[0][1])
    print count
    print countFail
mail.close()

mail.logout()

the problem is, it returns None for worked_time for some of my strings (not all, more than a half works [23 works, 8 not]), which means that the pattern is not matched. I tested it with most online regex testers, and they all told me, that the pattern matches and everything fine..

here a few example strings that weren't accepted but are by online tools, e.g. http://regex101.com

pasted them, because it they are big and ugly: http://pastebin.com/4Z2BdmXk http://pastebin.com/dMxcRqQu

btw the regex for date works fine on all (but not on the pasted string I had to cut away the upper part because of a lot of private information)

worked_time_pattern should search for something like: "1,5" disabled selected style=3D"">1,5</option> (and get the 1,5 out of it, exaclty as it does on half of the cases...)

Anybody any idea?

kave
  • 461
  • 1
  • 6
  • 17
  • I don't know if that is the problem, but you should probably use raw strings for regex, i.e. `re.compile(r'my regex here')`. Notice the "r" before the string. – rantanplan Dec 03 '13 at 23:44
  • I just tested both pastebins in the python interpreter and both matches were found. Since I was pasting it into the interpreter I only pasted in from the start of the select field (since there are no line breaks after that). – OGHaza Dec 03 '13 at 23:50
  • @rantanplan thanks, I changed that but didn't help me :/ – kave Dec 03 '13 at 23:53
  • @OGHaza as I said, it works on the online regex testers fine, and also works for half of the strings in my interpreter. but somehow not for the other half – kave Dec 03 '13 at 23:54
  • I'm not talking about an online regex tester. I'm talking about executing your code against the python interpreter on my machine - the way I run all my python code for stackoverflow questions. – OGHaza Dec 03 '13 at 23:55
  • @OGHaza that makes it even more odd for me – kave Dec 03 '13 at 23:57
  • 1
    @gumble - just for the sake of testing can you try it with the following regex `"(?P – OGHaza Dec 04 '13 at 00:23
  • @OGHaza hey thanks a lot! previously 16 worked, now 23 work. still 8 not working. I updated my code and the examples that still don't work. – kave Dec 04 '13 at 00:44
  • I don't know how `mail.fetch` (or whatever grabs the data) works but is there a chance it's putting line breaks in the data? It seems likely that the input data you're matching against is not exactly as you're expecting it to be. - Gotta go now, will check back tomorrow – OGHaza Dec 04 '13 at 01:05
  • Not sure if this helps but I was told not to trust all the online regex testers as there are different flavours of regexes and the python on you're using might not necessarily be the same as the one that the online testers are using. – ishikun Dec 04 '13 at 01:20
  • You wouldn't be trying to parse HTML with regex would you? – Sinkingpoint Dec 04 '13 at 01:54
  • @OGHaza hey, yeah it is putting =\r\n in the data. but I already delete them in my code...the pastes are directly copied from the console into pastebin – kave Dec 04 '13 at 10:56
  • Again for the sake of testing: `"(\r?\n?)(?P – OGHaza Dec 04 '13 at 11:07

1 Answers1

-1

If you think it is inserting =\r\n into your data then keep removing that, but also remove all \rs and \ns.

mailbody = "".join(data[0][1].split("=\r\n"))
mailbody = "".join(data[0][1].split("\r"))
mailbody = "".join(data[0][1].split("\n"))

Then try using the regex I suggested in the comments - although your original expression would likely work fine too.

(?P<time>\d+(,\d)?)"[^>]*?selected[^>]*>=?(\r?\n?)(?P=time)<

As Quirliom suggests in the comments, this is a perfect example of why regex shouldn't be used for parsing HTML - although if the line breaks are present mid-word then this isn't valid HTML either.

OGHaza
  • 4,795
  • 7
  • 23
  • 29
  • 1
    hey, thanks again. I edited that, but it still won't work :/ still 23 work/ 8 not. what would I use to parse HTML then? I thought stuff like this is what regex is for, but if there is an alternative I am open for it! – kave Dec 04 '13 at 19:03