0

Suppose i want to match a Regex for finding the domain address in text. (including the sub domain if any) eg: it should match

abc.xyz. 
google.
yahoo.
mail.google.

Snippet:

pattern = '((\s*\w+.\s*)+)'
matches = re.findall(pattern,line)
for m in matches:
 .. 
 ..

The inner parenthesis will give the m[0] which i don't need and i will only need m[1]. what is the substitution of the inner parenthesis so that i get my result in m[0].

PS: Having extra match groups () is confusing and i want to avoid using them unless i need those particular value.

David
  • 4,634
  • 7
  • 35
  • 42
  • You could use a unnamed group, `((?:\s*\w+.\s*)+)`, this way the inner group is not gonna be captured – Augusto Hack Nov 03 '13 at 23:17
  • possible duplicate of [Python urlparse -- extract domain name without subdomain](http://stackoverflow.com/questions/14406300/python-urlparse-extract-domain-name-without-subdomain) – Ben Nov 03 '13 at 23:17
  • @Ben, this is not duplicate. I am giving a example but asking a more broader syntactical question. Hack.augusto have a point above – David Nov 03 '13 at 23:19
  • @hack.augusto so what will be m[0] in this case? – David Nov 03 '13 at 23:20
  • @hack.augusto is there any other way to parenthesize things without using () – David Nov 03 '13 at 23:21
  • @David, running `python` is the fastest way to see the results, in this case `re.findall('((?:\s*\w+.\s*)+)', 'abc.xyz.')` gives `['abc.xyz.']`, grouping can only be done [with parenthesis](http://docs.python.org/2/library/re#regular-expression-syntax) – Augusto Hack Nov 03 '13 at 23:27
  • @hack.augusto, i wish you have replied "as answer" to the question rather than putting it to the comment. I am accepting Barmar response which is quiet similar to you but came 10 minutes later than your comments. – David Nov 03 '13 at 23:33
  • @David, you probably do not want to use a plain `.` in your regex, it matches anything, try this `re.findall('((?:\w+[.])+)+', 'abc.xyz. \ngoogle.\nyahoo.\nmail.google.')` – Augusto Hack Nov 03 '13 at 23:34

1 Answers1

4

You can make a group non-capturing by putting ?: at the beginning:

((?:\s*\w+.\s*)+)

BTW, the outer parenthese are m[1] and the inner parentheses are m[2] -- numbering works by counting left parentheses, starting from 1. m[0] refers to the entire regexp. In your case, it's the same as m[1] because you have the entire thing in a group (why?).

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • thanks Barmar, this let say we have abcd and the pattern is '(ab)|(cd)' what will me m[0] an m[1] here. Will it not be m[0] ab and m[1] cd – David Nov 03 '13 at 23:24
  • In the first match, `m[0]` and `m[1]` will be `ab`, `m[2]` will be empty. In the second match, `m[0]` and `m[2]` will be `cd`, `m[1]` will be empty. – Barmar Nov 03 '13 at 23:26
  • are you using findall() to match? – David Nov 03 '13 at 23:27
  • Your question uses `findall()`, so that's what I'm referring to. – Barmar Nov 03 '13 at 23:28
  • just as a side question which one is more faster for the performance: if i have two regex which are mutually orthogonal r = (regex1)|(regex2). matches = re.findall(r,line) or store these regex1 and regex2 in as an list and do two seperate search on it as described in http://stackoverflow.com/questions/14100868/python-multiple-regular-expressions – David Nov 04 '13 at 05:19
  • Intuitively I'd expect the combined regexp to be faster, since it only has to be compiled once and you only have to call the regexp matcher once. Note that you can rewrite your regexp to just `re1|re2`, then you don't have to worry about whether the match is in group 1 or 2, it will be in 0 for all matches. – Barmar Nov 04 '13 at 13:50