Python Regex to Extract Email Information

Question

I Have below data, from this I want to retrieve only the message boy part and remove all the info related to “Forward” header.

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 03/21/2000 
01:24 PM ---------------------------

Stephane Brodeur
03/16/2000 07:06 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Maps

As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.

What I have tried so far is below regular expression. matchObjj = re.search(r'(---.*?)Subject:', tmp_text, re.DOTALL)

When I print using below command

print( tmp_text[matchObjj.span()[1]:])

I get below output.

Maps
 
As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.

So basically the issue is that the regex is not stripping the complete line of “Subject:” and only the header Subject: is removed but the actual subject text is still there which in this case is “Maps”. I want the regex to detect the text till end of Subject line and then remove it. Please share your thoughts.

score 0 · Answer 1 · answered Oct 03 '21 at 18:47

0

The simplest way should be to change your regex to this:

r'(---.*?)Subject:[^\n]*\n'

This will make your match extend all the way to the next newline, making the end of its span the start of the next line.

answered Oct 03 '21 at 18:47

rkechols

558
5
15

Thanks rk, it really worked and got the output. – Asad Kamal Oct 03 '21 at 19:09
@AsadKamal then please accept as the answer – rkechols Oct 03 '21 at 19:17

score 0 · Answer 2 · answered Oct 03 '21 at 18:52

You can do this without regex by creating a list of sentences with splitlines and slicing this list from the Subject line:

text = '''---------------------- Forwarded by Phillip K Allen/HOU/ECT on 03/21/2000 
01:24 PM ---------------------------
 
 
Stephane Brodeur
03/16/2000 07:06 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Maps
 
As requested by John, here's the map and the forecast...
Call me if you have any questions'''

data = text.splitlines()
slice_idx = [i for i, s in enumerate(data) if s.startswith('Subject: ')][0]
body = '/n'.join(data[slice_idx+2:])

output:

As requested by John, here's the map and the forecast...
Call me if you have any questions

RJ reason i m using regex is because one email thread can contain multiple forward blocks and using the slicing will become difficult with multiple forward block in same email thread. — Asad Kamal, Oct 03 '21 at 19:11

Owais Ch · Answer 3 · 2021-10-03T19:01:51.577

-1

There are more spaces after the subject line, or maybe there is \t separation for your case. You can try to match the case with two or more spaces. e.g.

regexEquation = "(---.*?)Subject:[^\n]*(\s)+"

You can get help for matching more spaces from here or here.

**Output**: As requested by John, here's the map and the forecast...
Call me if you have any questions (403) 974-6756.

edited Oct 03 '21 at 19:01

answered Oct 03 '21 at 18:42

Owais Ch

177
1
12

UPD: It will not give you the required result, the answer by @rkechlos will work fine. – Owais Ch Oct 03 '21 at 18:59

Python Regex to Extract Email Information

3 Answers3