Regex not working locally though working in every online regex tester

Question

I am doing a small script, which should read my worked time from my email and save how much time I already worked. It is doing this through regex. Now here is my Script:

import imaplib import re from pprint import pprint

mail = imaplib.IMAP4_SSL('imap.gmail.com',993)
mail.login('*************', '**************')
# Out: list of "folders" aka labels in gmail.
mail.select("inbox") # connect to inbox.

typ, data = mail.search(None, 'SUBJECT', 'Zeiterfassung')

worked_time_pattern = re.compile(r'"(?P<time>\d+(,\d)?)"[^>]*?selected[^>]*>=?(\r?\n?)(?P=time)<')
# old version: worked_time_pattern = re.compile(r'\"(?P<time>[0-9]+(?:[,][0-9])?)\"(?: disabled)? selected(?: disabled)? style=3D"">[=]?[\n]?(?P=time)<\/option>')
date_pattern = re.compile('.*Date: [a-zA-Z]{1,4}[,] (?P<date>[0-9]{1,2} [a-zA-Z]{1,4} [0-9]{4}).*', re.DOTALL)
count = 0
countFail = 0
if 'OK' == typ:
    for num in data[0].split():
        typ, data = mail.fetch(num, '(RFC822)')
        mailbody = "".join(data[0][1].split("=\r\n"))
        mailbody = "".join(mailbody.split("\r"))
        mailbody = "".join(mailbody.split("\n"))
        worked_time = worked_time_pattern.search(data[0][1])
        date = date_pattern.match(data[0][1])
        if worked_time != None:
            print worked_time.group('time')
            count = count + 1
        else:
            print mailbody
            countFail = countFail + 1
        print worked_time
        print "You worked  on %s\n" % ( date.group('date'))
        #print 'Message %s\n%s\n' % (num, data[0][1])
    print count
    print countFail
mail.close()

mail.logout()

the problem is, it returns None for worked_time for some of my strings (not all, more than a half works [23 works, 8 not]), which means that the pattern is not matched. I tested it with most online regex testers, and they all told me, that the pattern matches and everything fine..

here a few example strings that weren't accepted but are by online tools, e.g. http://regex101.com

pasted them, because it they are big and ugly: http://pastebin.com/4Z2BdmXk http://pastebin.com/dMxcRqQu

btw the regex for date works fine on all (but not on the pasted string I had to cut away the upper part because of a lot of private information)

worked_time_pattern should search for something like: "1,5" disabled selected style=3D"">1,5</option> (and get the 1,5 out of it, exaclty as it does on half of the cases...)

Anybody any idea?

I don't know if that is the problem, but you should probably use raw strings for regex, i.e. `re.compile(r'my regex here')`. Notice the "r" before the string. — rantanplan, Dec 03 '13 at 23:44
I just tested both pastebins in the python interpreter and both matches were found. Since I was pasting it into the interpreter I only pasted in from the start of the select field (since there are no line breaks after that). — OGHaza, Dec 03 '13 at 23:50
@OGHaza as I said, it works on the online regex testers fine, and also works for half of the strings in my interpreter. but somehow not for the other half — kave, Dec 03 '13 at 23:54
I'm not talking about an online regex tester. I'm talking about executing your code against the python interpreter on my machine - the way I run all my python code for stackoverflow questions. — OGHaza, Dec 03 '13 at 23:55
@gumble - just for the sake of testing can you try it with the following regex `"(?P\d+(,\d)?)"[^>]*?selected[^>]*>=?(\r?\n?)(?P=time)<` - it's not that this regex is better, it's just more lenient. I expect it still won't work - but at that point I'm going to have to give up — OGHaza, Dec 04 '13 at 00:23
@OGHaza hey thanks a lot! previously 16 worked, now 23 work. still 8 not working. I updated my code and the examples that still don't work. — kave, Dec 04 '13 at 00:44
I don't know how `mail.fetch` (or whatever grabs the data) works but is there a chance it's putting line breaks in the data? It seems likely that the input data you're matching against is not exactly as you're expecting it to be. - Gotta go now, will check back tomorrow — OGHaza, Dec 04 '13 at 01:05
Not sure if this helps but I was told not to trust all the online regex testers as there are different flavours of regexes and the python on you're using might not necessarily be the same as the one that the online testers are using. — ishikun, Dec 04 '13 at 01:20
@OGHaza hey, yeah it is putting =\r\n in the data. but I already delete them in my code...the pastes are directly copied from the console into pastebin — kave, Dec 04 '13 at 10:56
Again for the sake of testing: `"(\r?\n?)(?P(\d(\r?\n?))+((\r?\n?),(\r?\n?)\d)?)(\r?\n?)"[^>]*?s(\r?\n?)e(\r?\n?)l(\r?\n?)e(\r?\n?)c(\r?\n?)t(\r?\n?)e(\r?\n?)d` If that matches then it is 100% the case that your input data contains line breaks that aren't present in the paste bin. The solution to which is either properly cleanse your data or as Quirliom is suggesting: use an HTML parser. — OGHaza, Dec 04 '13 at 11:07

score -1 · Answer 1 · answered Dec 04 '13 at 11:17

If you think it is inserting =\r\n into your data then keep removing that, but also remove all \rs and \ns.

mailbody = "".join(data[0][1].split("=\r\n"))
mailbody = "".join(data[0][1].split("\r"))
mailbody = "".join(data[0][1].split("\n"))

Then try using the regex I suggested in the comments - although your original expression would likely work fine too.

(?P<time>\d+(,\d)?)"[^>]*?selected[^>]*>=?(\r?\n?)(?P=time)<

As Quirliom suggests in the comments, this is a perfect example of why regex shouldn't be used for parsing HTML - although if the line breaks are present mid-word then this isn't valid HTML either.

hey, thanks again. I edited that, but it still won't work :/ still 23 work/ 8 not. what would I use to parse HTML then? I thought stuff like this is what regex is for, but if there is an alternative I am open for it! — kave, Dec 04 '13 at 19:03

Regex not working locally though working in every online regex tester

1 Answers1