Regular expression doesn't extract whole the id from a log file?

Question

I have following input in the log file which I am interested to capture all the part of IDs, however it won't return me the whole of the ID and just returns me some part of that:

id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750

I have used the following regular expression with python 2.7:

I have edited sid to id:
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9._+]*))', re.U)

to

>>> RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U)
>>> sid = RE_SID.search('id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤').group('sid')
>>> sid
'A2uhasan30hamwix'

and this is my result:

is: A2uhasan30hamwix

After edit: This is how I am reading the log file:

with open(cfg.log_file) as input_file: ...
     fields = line.strip().split(' ')

and an example of line in log:

2015-11-30T23:58:13.760950+00:00 calxxx enexxxxce[10476]: INFO consume_essor: user:<<"ailxxxied">> callee_num:<<"+144442567413">> id:<<"A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧">> credits:0.0 result:ok provider:sipovvvv1.yv.vs

I will appreciated to help me to edit my regular expression.

Try [`id:(<<")?(?P[A-Za-z\d._+]*)`](https://regex101.com/r/wW6mE2/1). Note you do not have `sid:` in your input. — Wiktor Stribiżew, Mar 29 '16 at 16:54
@WiktorStribiżew I have changed to `RE_SID = re.compile(r'id:(<<")?(?P[A-Za-z\d._+]*)', re.U)`. Output of my python is still the same without any changes. — pm1359, Mar 30 '16 at 09:20
What version of Python is it? How do you obtain the input string? Please post all relevant details in the quesiton. You also need to use `u` alongside the `r` prefix. — Wiktor Stribiżew, Mar 30 '16 at 09:26
@Wiktor Stribiżew It is exactly as I have shown in above. just the latin alphabet. python version is : 2.7 . I have added ur as prefix , not changing! I would update some thing more! — pm1359, Mar 30 '16 at 10:03
*I have following input in the log file* - did you encode it in UTF8 after `read()`ing it? — Wiktor Stribiżew, Mar 30 '16 at 10:05
Please post the code showing how you read the file in. It should be something like `import codecs // f = codecs.open('myFile.txt', encoding='utf-8') // for line in f:`, or after reading it, use `.encode('utf-8')` on it. — Wiktor Stribiżew, Mar 30 '16 at 10:11
I edited your question with my suggestion, please check. I get `A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤`, `A2uhasan30hamwix160212145302428`, `A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١`, `A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢`, `A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧`, `A2uhasan30hamwix160207145023750` — Wiktor Stribiżew, Mar 30 '16 at 10:32
@Wiktor Stribiżew, I have user `with open` and there is not any encoding parameter for that. As I am new to the python, I would like to be sure that this change doesn't have a side affect in whole the code, I mean I have write too `csvwriter = csv.writer(csv_test,quoting=csv.QUOTE_MINIMAL)`. Do I have to change the rest of things as well? — pm1359, Mar 30 '16 at 10:51
Writing is something different, please do not make the question *too broad*. — Wiktor Stribiżew, Mar 30 '16 at 10:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107724/discussion-between-maryam-pashmi-and-wiktor-stribizew). — pm1359, Mar 30 '16 at 11:05

score 1 · Answer 1 · edited May 23 '17 at 12:24

1

3 things to fix:

id instead of sid
use \d instead of 0-9 to also catch the arabic numerals
no need to add an extra capturing group inside the sid named group

Fixed version:

id:(<<")?(?P<sid>[A-Za-z\d_.+]+)

edited May 23 '17 at 12:24

Community

1
1

answered Mar 29 '16 at 17:03

alecxe

462,703
120
1,088
1,195

I have change to what you written there, the result are the same without the arabic numbers `RE_SID = re.compile(r'id:(<<")?(?P[A-Za-z\d_.+]+)', re.U)` – pm1359 Mar 30 '16 at 09:22

score 1 · Accepted Answer · answered Mar 30 '16 at 11:20

1

Based on what we discussed in the chat, posting the solution:

import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8')  # Read the file with UTF8 encoding
for line in input_file: 
    fields = line.strip().split(u' ') # u prefix is important!
    if len(fields) >= 11: 
    try: 
        # ...... 
        sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first

answered Mar 30 '16 at 11:20

Wiktor Stribiżew

607,720
39
448
563

1

I would like to add, I could use `with codecs.open(cfg.log_file, encoding='utf-8') as input_file:` using `codecs.open` with `with` and it works perfectly too. The interesting point is that in this case I don't need to write even `u` in `....split(u' ')` and will work with `fields = line.strip().split(' ')` @Wiktor Stribiżew – pm1359 Mar 30 '16 at 13:40

score 0 · Answer 3 · edited Mar 29 '16 at 17:08

string = '''
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750
'''
import re
reObj = re.compile(r'id:.*')
ans = reObj.findall(string,re.DOTALL)

print(ans)

Output :

['id:A2uhasan30hamwix160212145302428 ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ', 
 'id:A2uhasan30hamwix160207145023750']

Regular expression doesn't extract whole the id from a log file?

3 Answers3