3

This the text file abc.txt

abc.txt

aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in

I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.

parser.py

import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
    site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
    print('Regex found that site_line.group(2) = '+str(site_line.group(2))

Why is the output

Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2

Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2

But Why ?

Dhiwakar Ravikumar
  • 1,983
  • 2
  • 21
  • 36
  • Why you are using `re.search` instead `re.match`? – Jimilian Feb 23 '15 at 17:47
  • 2
    regex is overkill for what you're trying to accomplish. Just split the line on the colon, and you will get the elements as a list (`line.split(':')`) – Darrick Herwehe Feb 23 '15 at 17:54
  • "overkill" ? Does that mean its a pretty complicated way of achieving something simple ? :) Or will it be slower than line.split(':') ? Thanks I'll use line.split but I'm also learning Regex which is why the question :) – Dhiwakar Ravikumar Feb 23 '15 at 17:58

2 Answers2

3

Let's show a simplified example:

>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'

If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.

If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.

That said, as the comments suggest, regexes are overkill for this.

>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
user2357112
  • 260,549
  • 28
  • 431
  • 505
2

And first group is entire match by default.

If a groupN argument is zero, the corresponding return value is the entire matching string.

So you should skip it. And check group(3), if you want last one.

Also, you should compile regexp before for-loop. It increase performance of your parser.

And you can replace (\w)* to (\w*), if you want match all symbols between :.

Jimilian
  • 3,859
  • 30
  • 33
  • While there may be benefits to pre-compiling, [performance improvement is questionable](http://stackoverflow.com/a/452143/2348587). – Darrick Herwehe Feb 23 '15 at 17:56
  • @interjay, this answer was based on my conclusions. 1) OP asks, what's wrong with brackets []. Only last group has brackets. So, I decided, that OP wants to print last group. 2) OP are not using group(0), but I decided, that OP want to print last group. But he are using group(2) for this purpose. What's wrong, because group(0) is "bonus". – Jimilian Feb 23 '15 at 18:11