REGEX - (using Python 3.5) - finding strings in file

Question

I have a .msg outlook file I'm opening and need to extract some specific data from it. I'm still a little new to regex and am having trouble finding what I need.

Below is the data from the file, it contains some tabs it seems just fyi:

NEWS ID:    918273/1
TITLE:  News Platform Solution Overview (CNN) (US English Session)
ACCOUNT:    supernewsplatformacct (55712)

Your request has been completed.

Output Format   MP4

Please click on the "Download File" link below to access the download page.

Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>

I need:

918273 -from- NEWS ID: 918273/1

News Platform Solution Overview (CNN) (US English Session) -from- TITLE: News Platform Solution Overview (CNN) (US English Session)

supernewsplatformacct -from- ACCOUNT: supernewsplatformacct (55712)

http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4 -from- Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>

I'm trying

[\n\r][ \t]*NEWS ID:[ \t]*([^\n\r]*)

But with no luck. Any help would be greatly appreciated!

Possible duplicate of [Learning Regular Expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions) — Biffen, Dec 09 '16 at 20:13
Use `\s` (whitespace) instead of combinations of space, tab, and `\r / \n`. Just to make things cleaner. Why does your regex start with `[\n\r]`? And can you show us some python code? — Alex Hall, Dec 09 '16 at 20:17

vks · Accepted Answer · 2016-12-09T20:38:50.213

2

(?:^|(?<=\n))[^:<\n]*[:<](.*)

You can use this with re.findall.See demo.

https://regex101.com/r/d7RPNB/2

edited Dec 09 '16 at 20:38

answered Dec 09 '16 at 20:19

vks

67,027
10
91
124

while this is close the last item he wanted was `http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4` and you only get `//news......` – depperm Dec 09 '16 at 20:25
I see the demo and that's what I'm saying is it doesn't match his requirements exactly, here is a slightly modified version which gets the `http://....`, I think something with named groups where he has to check if it's a url or not would work `(?:^|(?<=\n))[^:\n]*[^http]:\s*(?P.*)|(?Phttp:.*)>` – depperm Dec 09 '16 at 20:36
1

I modified yours to not get extras space or `>` at the end, check https://regex101.com/r/WJkRK5/1 – depperm Dec 09 '16 at 20:43

freegnu · Answer 2 · 2016-12-09T21:08:03.187

msg = """NEWS ID:    918273/1
TITLE:  News Platform Solution Overview (CNN) (US English Session)
ACCOUNT:    supernewsplatformacct (55712)

Your request has been completed.

Output Format   MP4

Please click on the "Download File" link below to access the download page.

Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>"""
import re
regex = r'[^:]+:\s+(.*)$|[^<]+<([^>]+)>'
matches = [re.match(regex, i).group(1) or re.match(regex, i).group(2) for i in msg.split('\n') if i and re.match(regex, i)]
print(matches)

REGEX - (using Python 3.5) - finding strings in file

2 Answers2