3

I have following input in the log file which I am interested to capture all the part of IDs, however it won't return me the whole of the ID and just returns me some part of that:

id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750

I have used the following regular expression with python 2.7:

I have edited sid to id:
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9._+]*))', re.U)

to

>>> RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U)
>>> sid = RE_SID.search('id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤').group('sid')
>>> sid
'A2uhasan30hamwix'

and this is my result:

is: A2uhasan30hamwix

After edit: This is how I am reading the log file:

with open(cfg.log_file) as input_file: ...
     fields = line.strip().split(' ')

and an example of line in log:

2015-11-30T23:58:13.760950+00:00 calxxx enexxxxce[10476]: INFO consume_essor: user:<<"ailxxxied">> callee_num:<<"+144442567413">> id:<<"A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧">> credits:0.0 result:ok provider:sipovvvv1.yv.vs

I will appreciated to help me to edit my regular expression.

pm1359
  • 622
  • 1
  • 10
  • 31
  • Do you want your regex to also capture the arabic numerals? – tzaman Mar 29 '16 at 16:53
  • Try [`id:(<<")?(?P[A-Za-z\d._+]*)`](https://regex101.com/r/wW6mE2/1). Note you do not have `sid:` in your input. – Wiktor Stribiżew Mar 29 '16 at 16:54
  • @WiktorStribiżew I have changed to `RE_SID = re.compile(r'id:(<<")?(?P[A-Za-z\d._+]*)', re.U)`. Output of my python is still the same without any changes. – pm1359 Mar 30 '16 at 09:20
  • What version of Python is it? How do you obtain the input string? Please post all relevant details in the quesiton. You also need to use `u` alongside the `r` prefix. – Wiktor Stribiżew Mar 30 '16 at 09:26
  • @Wiktor Stribiżew It is exactly as I have shown in above. just the latin alphabet. python version is : 2.7 . I have added ur as prefix , not changing! I would update some thing more! – pm1359 Mar 30 '16 at 10:03
  • *I have following input in the log file* - did you encode it in UTF8 after `read()`ing it? – Wiktor Stribiżew Mar 30 '16 at 10:05
  • @Wiktor Stribiżew no, I didn't encode it, how? – pm1359 Mar 30 '16 at 10:09
  • Please post the code showing how you read the file in. It should be something like `import codecs // f = codecs.open('myFile.txt', encoding='utf-8') // for line in f:`, or after reading it, use `.encode('utf-8')` on it. – Wiktor Stribiżew Mar 30 '16 at 10:11
  • I edited your question with my suggestion, please check. I get `A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤`, `A2uhasan30hamwix160212145302428`, `A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١`, `A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢`, `A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧`, `A2uhasan30hamwix160207145023750` – Wiktor Stribiżew Mar 30 '16 at 10:32
  • Is that code working for you? – Wiktor Stribiżew Mar 30 '16 at 10:44
  • @Wiktor Stribiżew, I have user `with open` and there is not any encoding parameter for that. As I am new to the python, I would like to be sure that this change doesn't have a side affect in whole the code, I mean I have write too `csvwriter = csv.writer(csv_test,quoting=csv.QUOTE_MINIMAL)`. Do I have to change the rest of things as well? – pm1359 Mar 30 '16 at 10:51
  • Writing is something different, please do not make the question *too broad*. – Wiktor Stribiżew Mar 30 '16 at 10:52
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107724/discussion-between-maryam-pashmi-and-wiktor-stribizew). – pm1359 Mar 30 '16 at 11:05

3 Answers3

1

3 things to fix:

Fixed version:

id:(<<")?(?P<sid>[A-Za-z\d_.+]+)
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I have change to what you written there, the result are the same without the arabic numbers `RE_SID = re.compile(r'id:(<<")?(?P[A-Za-z\d_.+]+)', re.U)` – pm1359 Mar 30 '16 at 09:22
1

Based on what we discussed in the chat, posting the solution:

import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8')  # Read the file with UTF8 encoding
for line in input_file: 
    fields = line.strip().split(u' ') # u prefix is important!
    if len(fields) >= 11: 
    try: 
        # ...... 
        sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    I would like to add, I could use `with codecs.open(cfg.log_file, encoding='utf-8') as input_file:` using `codecs.open` with `with` and it works perfectly too. The interesting point is that in this case I don't need to write even `u` in `....split(u' ')` and will work with `fields = line.strip().split(' ')` @Wiktor Stribiżew – pm1359 Mar 30 '16 at 13:40
0
string = '''
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750
'''
import re
reObj = re.compile(r'id:.*')
ans = reObj.findall(string,re.DOTALL)

print(ans)

Output :

['id:A2uhasan30hamwix160212145302428 ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ', 
 'id:A2uhasan30hamwix160207145023750']
Vincent Savard
  • 34,979
  • 10
  • 68
  • 73
Ash Ishh
  • 551
  • 5
  • 15