1

The following regex:

(?:X-)?Received: (?:by|from) ([^ \n]+)

will, for the following lines, match the text in bold:

Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03

Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;

Received: from localhost (localhost [127.0.0.1])

If I alter the text such that "Received by: " and "Received: from " are removed in each line, leaving me with:

from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03

by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;

from localhost (localhost [127.0.0.1])

How do I update the regex then to just match the IP addresses or domains (e.g. mail.oknotify2.com, 10.66.156.198) in this text?

I can reduce it to (?:by|from) ([^ \n]+) and that will give me "from mail.oknotify2.com", "by 10.66.156.198" etc. But how do I go the last step and omit the "by " and "from ", leaving only the domain/IP address? The final regex should also, as the original, ignore subsequent domains/IPs per line where present e.g. mx.google.com in the first line.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Pyderman
  • 14,809
  • 13
  • 61
  • 106
  • I am not sure what you want: 2 regexps to match domains and IPs? [This regex](https://regex101.com/r/yC6rI0/1) is not what you are looking for? – Wiktor Stribiżew Jun 04 '15 at 18:12
  • @stribizhev Thanks for the suggested regex. It includes the trailing "by " and "from " in the matches though. What I'm looking for is to omit these, leaving just the domain or IP. – Pyderman Jun 04 '15 at 18:23
  • 1
    What regex flavor you are using? In [.NET, you can use variable width look-behind](http://regexstorm.net/tester?p=(%3f%3c%3d%5e(%3f%3aby%7cfrom)%5cs*)%5cS%2b&i=from+mail2.oknotify2.com+(mail2.oknotify2.com.+%5b208.83.243.70%5d)+by+mx.google.com+with+ESMTP+id+dp5si2596299pdb.170.2015.06.03.14.12.03%0d%0a%0d%0aby+10.66.156.198+with+SMTP+id+wg6mr62843415pab.126.1433365924352%3b%0d%0a%0d%0afrom+localhost+(localhost+%5b127.0.0.1%5d)&o=m), in [PCRE, there is `\K`](https://regex101.com/r/yC6rI0/2). – Wiktor Stribiżew Jun 04 '15 at 18:32
  • @stribizhev odd, when I take your regex over to nregex.com and try it there, the only match returned is "from mail2.oknotify2.com". Yet when I run the generated Python code on regex101, I get exactly what I'm looking for: just the domains and IPs. The re.MULTILINE seems to be a factor, but what else explains the difference in behaviour? P.S. Thanks for the solution, and for making me aware of regex101.com – Pyderman Jun 04 '15 at 18:45
  • So, Python? I will post my answer for Python then. – Wiktor Stribiżew Jun 04 '15 at 18:51

3 Answers3

2

You can use \K to discard previous matches:

(?:X-)?Received: (?:by|from) \K([\S]+)

See Demo

EDIT:

Like @James Newton said, this however is not supported by all regex flavors, you can refer to this post to see if your engine supports it:

https://stackoverflow.com/a/13543042/3393095

EDIT 2:

Since you specified Python, just using the capturing groups and re.findall on your regex will do, like this:

>>> import re
>>> text = ("Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n"
... "Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n"
... "Received: from localhost (localhost [127.0.0.1])")
>>> re.findall(r'(?:X-)?Received: (?:by|from) ([\S]+)', text)
['mail2.oknotify2.com', '10.66.156.198', 'localhost']
Community
  • 1
  • 1
Rodrigo López
  • 4,039
  • 1
  • 19
  • 26
  • 1
    Caution: If this is for JavaScript then the \K will simply match uppercase K. – James Newton Jun 04 '15 at 18:41
  • @JamesNewton Thanks. I'm only looking to match the domain name and IP themselves though, no leading characters. So if I pare back your regex thus: (?:by|from) \K([\S]+) this almost does the job, but it still matches the additional domain (mx.google.com) in line 1. How can we prevent this? – Pyderman Jun 04 '15 at 18:55
1

I'm writing an answer because a comment does not allow for formatting, but the correct answer is given by @stribizhev.

@stribizhev proposed this regex:

^(?:by|from) (\S+)

The ?: at the beginning of (?:by|from) makes it a non-capturing group. (\S+) is a capturing group. If you use result = string.match(regex), and there is a match, then result will contain an array such as ["from mail2.oknotify2.com", "mail2.oknotify2.com"]. The value of result[1] is the captured group.

James Newton
  • 6,623
  • 8
  • 49
  • 113
1

You can use the re.MULTILINE flag to enable multiline mode to enable matching some text at the start of a line with ^. To get the necessary text, you will have to use a capturing group.

It is a pity that Python regex does not support \K, nor variable-width look-behind (with the native re library). However, a variable width look-behind is possible to use with the regex external library.

Here is a sample code that you can use:

import re
p = re.compile(ur'^(?:by|from) (\S+)', re.MULTILINE)
test_str = u"from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n\nby 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n\nfrom localhost (localhost [127.0.0.1])"
print [x.group(1) for x in re.finditer(p, test_str)]

Output of a demo program:

[u'mail2.oknotify2.com', u'10.66.156.198', u'localhost']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563