0

I'm new to Python and apologies if this is trivial for you.Some of email contain following line in email body:

Event demon log entry:

[27/12/2018 08:15:02] CAUAJM_I_40245 EVENT: ALARM ALARM: MAXRUNALARM JOB: p1_credit_qv_curve_snap MACHINE: p1prog06

With this code

#!/usr/bin/python

import email, imaplib, re
user = 'user@example.com'
pwd = 'pass'

conn = imaplib.IMAP4_SSL("outlook.office365.com")
conn.login(user,pwd)
conn.select("Inbox")

resp, items = conn.uid("search",None, 'All')
items = items[0].split()
for emailid in items:
    resp, data = conn.uid("fetch",emailid, "(RFC822)")
    if resp == 'OK':
        email_body = data[0][1].decode('utf-8')
        mail = email.message_from_string(email_body)
        if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
          match=re.findall(r'Event demon log entry.*\n.*\n.*', email_body , re.IGNORECASE)
           print match

i'm getting:

[u'Event demon log entry:\r\n\r\n[27/12/2018 08:15:02] CAUAJM_I_40245 EVENT: ALARM ALARM: MAXRUNALARM JOB: p=\r', u'Event demon log entry:<br><br=\r\n>[27/12/2018 08:15:02]      CAUAJM_I_40245 EVENT: ALARM            ALARM: M=\r\nAXRUNALARM      JOB: p1_credit_qv_curve_snap MACHINE: p1prog06<br><br>Attac=\r']

How to get rid of those HTML outputs ?

i need following output (if it's possible in one line):

Event demon log entry:[27/12/2018 08:15:02] CAUAJM_I_40245 EVENT: ALARM ALARM: MAXRUNALARM JOB: p1_credit_qv_curve_snap MACHINE: p1prog06

1 Answers1

0

You might use 2 capturing groups:

(\bEvent demon log entry:)(?:\r?\n|\r)+(\[[^]]+\].*)

See the regex demo | Python demo

That will match:

  • (\bEvent demon log entry:) Capture in the first group
  • (?:\r?\n|\r)+ Match 1+ times a new line (Or use {2} instead of + to match exactly 2 times)
  • (\[[^]]+\].*) Match [, then not a ] using a negated character class followed by matching a closing ]. Then match 0+ times any character except a new line

For example using findall:

import re
regex = r"(\bEvent demon log entry:)(?:\r?\n|\r)+(\[[^]]+\].*)"
email_body = ("Event demon log entry:\n\n"
            "[27/12/2018 08:15:02] CAUAJM_I_40245 EVENT: ALARM ALARM: MAXRUNALARM JOB: p1_credit_qv_curve_snap MACHINE: p1prog06")

for (g1, g2) in re.findall(regex, email_body , re.IGNORECASE):
    print(g1 + g2)
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • sorry, how to put this into re.findall ? –  Dec 27 '18 at 09:56
  • @xerks I have added a demo that uses either findall or search. [findall](https://docs.python.org/2/library/re.html?highlight=findall#re.findall) returns a list of tuples. For your example data, you get a list with 1 tuple. To get the first item from the list and then from the tuple the first and second group you could for example use `[0][0]` and `[0][1]` – The fourth bird Dec 27 '18 at 09:57
  • @xerks If you have mulitple matches you could also loop throug the result https://ideone.com/5laoo8 – The fourth bird Dec 27 '18 at 10:04
  • i have issues with this: `email_body = data[0][1].decode('utf-8') mail = email.message_from_string(email_body) if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0: pattern = r"(\bEvent demon log entry:)(?:\r?\n|\r)+(\[[^]]+\].*)" re.findall(pattern,emal_body)` got nothing as result –  Dec 27 '18 at 10:11
  • @xerks I think you don't see a result because you don't use print. I see you have updated you question to get rid of the html? Try it with a print statement https://ideone.com/Ow3gDV – The fourth bird Dec 27 '18 at 10:23
  • @xerks There are newlines in the rest of the data which you might also replace with an emtpy string. See this example https://ideone.com/kTAYy4 If you want to remove the html as well, you could take a look at [this page](https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) – The fourth bird Dec 27 '18 at 10:50