0

I have a .msg outlook file I'm opening and need to extract some specific data from it. I'm still a little new to regex and am having trouble finding what I need.

Below is the data from the file, it contains some tabs it seems just fyi:

NEWS ID:    918273/1
TITLE:  News Platform Solution Overview (CNN) (US English Session)
ACCOUNT:    supernewsplatformacct (55712)

Your request has been completed.

Output Format   MP4

Please click on the "Download File" link below to access the download page.

Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>

I need:

918273 -from- NEWS ID: 918273/1

News Platform Solution Overview (CNN) (US English Session) -from- TITLE: News Platform Solution Overview (CNN) (US English Session)

supernewsplatformacct -from- ACCOUNT: supernewsplatformacct (55712)

http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4 -from- Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>

I'm trying

[\n\r][ \t]*NEWS ID:[ \t]*([^\n\r]*)

But with no luck. Any help would be greatly appreciated!

Kenny
  • 2,124
  • 3
  • 33
  • 63
  • 1
    Possible duplicate of [Learning Regular Expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions) – Biffen Dec 09 '16 at 20:13
  • 2
    Use `\s` (whitespace) instead of combinations of space, tab, and `\r / \n`. Just to make things cleaner. Why does your regex start with `[\n\r]`? And can you show us some python code? – Alex Hall Dec 09 '16 at 20:17

2 Answers2

2
(?:^|(?<=\n))[^:<\n]*[:<](.*)

You can use this with re.findall.See demo.

https://regex101.com/r/d7RPNB/2

vks
  • 67,027
  • 10
  • 91
  • 124
  • while this is close the last item he wanted was `http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4` and you only get `//news......` – depperm Dec 09 '16 at 20:25
  • I see the demo and that's what I'm saying is it doesn't match his requirements exactly, here is a slightly modified version which gets the `http://....`, I think something with named groups where he has to check if it's a url or not would work `(?:^|(?<=\n))[^:\n]*[^http]:\s*(?P.*)|(?Phttp:.*)>` – depperm Dec 09 '16 at 20:36
  • 1
    I modified yours to not get extras space or `>` at the end, check https://regex101.com/r/WJkRK5/1 – depperm Dec 09 '16 at 20:43
0
msg = """NEWS ID:    918273/1
TITLE:  News Platform Solution Overview (CNN) (US English Session)
ACCOUNT:    supernewsplatformacct (55712)

Your request has been completed.

Output Format   MP4

Please click on the "Download File" link below to access the download page.

Download File <http://news.downloadwebsitefake.com/newsid/file1294757493292848575.mp4>"""
import re
regex = r'[^:]+:\s+(.*)$|[^<]+<([^>]+)>'
matches = [re.match(regex, i).group(1) or re.match(regex, i).group(2) for i in msg.split('\n') if i and re.match(regex, i)]
print(matches)
freegnu
  • 793
  • 7
  • 11