0

I'm trying to return only first match from bellow variable

MACHINE: p1prog06<br>

MACHINE: p1prog06

using following expression:

res = list(set([re.sub(r'=(?:\^M)?|[\r\n]+', '', m.group(1)) for m in re.finditer(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email_body, re.M)]))

According to documentations,

`list(set(res))`   

should return unique values, but i'm getting

u'p1prog06', u'p1prog06<br><br>']

Code:

conn = imaplib.IMAP4_SSL("outlook.office365.com")
conn.login(user,pwd)
conn.select("test")

resp, items = conn.uid("search" ,None, '(OR (FROM "email@pexample.com) (FROM "email2@pexample.com"))')



items = items[0].split()
for emailid in items:
    resp, data = conn.uid("fetch",emailid, "(RFC822)")
    if resp == 'OK':
        email_body = data[0][1].decode('utf-8')
        mail = email.message_from_string(email_body)
        #get all emails with words "PA1" or "PA2" in subject
        if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
                  #search email body for job name (string after word "JOB")
          regex1 = r'(?<!^)JOB:\s*(\S+)'
          #regex2 = r'\bMACHINE:\s*(.*(?:\s*^\d+)?)'
          #c=re.searchall(regex2, email_body, re.M)#,re.DOTALL)
          a=re.findall(regex1 ,email_body)
          #res = [re.sub(r'=(?:\^M)?|[\r\n]+', '', m.group(1)) for m in re.finditer(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email_body, re.M)]
          res = list(set([re.sub(r'=(?:\^M\<br><br>)?|[\r\n]+', '', m.group(1)) for m in re.finditer(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email_body, re.M)]))
  • 5
    Those two values are obviously not the same, and therefore they *are* unique. Note: *"unique"* does not mean *"only one"*. It just means "*no repetitions*". – user2390182 Jan 11 '19 at 14:09
  • 1
    Possible duplicate of [Regular expression to stop at first match](https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match) – tbhaxor Jan 11 '19 at 14:28
  • Take a look at https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match – tbhaxor Jan 11 '19 at 14:28
  • 1
    You might also remove all `
    ` tags, `re.sub(r'=(?:\^M)?|
    |[\r\n]+', '', m.group(1))`, and `list(set())` will do its job. FYI: The `re.M` is no longer necessary, there are no anchors in the pattern.
    – Wiktor Stribiżew Jan 11 '19 at 14:38
  • thanks again Wikor, you saved me again, it works !! –  Jan 11 '19 at 15:08

3 Answers3

0

As the comment points out, your examples are not unique so the functionality seems correct. Either add a term to the sub method to remove the <br> tags (and then your set command will drop the duplicate entries). Or if you only want the first match from the email_body maybe try just use the search method in the regex package.

nick
  • 1,310
  • 8
  • 15
  • Why not include some code to elaborate your point? Also `
    ` is not a tag here it is just the part of the string and `set` is not a command it is a class which convert the values into hashes and when two values generate same hash value that is how we get the unique values and that is the reason they are unordered.
    – mad_ Jan 11 '19 at 14:23
  • thanks for answer, i tried `re.sub(r'=(?:\^M\

    )?|[\r\n]+` but nothing changed
    –  Jan 11 '19 at 14:24
  • I believe what you have suggested above tries to match the `

    ` directly after the M. The following should work: `re.sub(r'(M.*:)|(\s)+|(
    )', ...)`. But I prefer @Predicate's answer so up-voting that.
    – nick Jan 11 '19 at 15:11
0

If you want you can improve your regex to this:

\bMACHINE:\s*([^<]*(?:(?:\r\n?|\n)\S+)?)

Now youre regex will stop at the < sign.

Superluminal
  • 947
  • 10
  • 23
0

Your main regex used in re.finditer matches <br> tags. All you need is to remove them with the re.sub:

re.sub(r'=(?:\^M)?|<br\s*(?:/\s*)?>|[\r\n]+', '', m.group(1))
                   ^^^^^^^^^^^^^^^^ 

You may also use it with re.findall like this:

res = list(set([re.sub(r'=(?:\^M)?|<br\s*(?:/\s*)?>|[\r\n]+', '', m) for m in re.findall(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email_body)]))

Note re.M is redundant and is removed.

The <br\s*(?:/\s*)?> pattern matches <br, then \s* matches 0+ whitespaces, (?:/\s*)? matches an opptional sequence of / and 0+ whitespaces, and > finally matches >. So, it can match <br/>, <br>, <br /> and even <br / >.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563