Remove unicode numbers in each line in the file

Question

My file(outputfile5.txt) contains : ( The file contain all element as unicode)

5അവന്‍ --> 1രാമന്‍   
6അവള്‍ക്ക് --> 2സീതയെ   
10അവള്‍ --> 6അവള്‍ക്ക് --> 2സീതയെ   
11അത്‌ --> 7പൂവ്‌   
14അവര്‍ --> 2സീതയെ , 1രാമന്‍   
19അവിടെ --> 16കോട്ടയത്ത്‌   
21അവര്‍ക്ക്‌ --> 2സീതയെ , 1രാമന്‍   
26അവിടെ --> 19അവിടെ --> 16കോട്ടയത്ത്‌   
32അവന്‍ --> 28രാമന്‍   
44അവനെ --> 40ലക്ഷ്മണന്‍   
45അവള്‍ക്ക്‌ --> 41സീതയെ   
48ഈ --> 49വഴ   
51അവര്‍ --> 41സീതയെ , 40ലക്ഷ്മണന്‍   
60അവിടെ --> 55കോട്ടയം

My reguired output should be saved in another file(result.txt) like:

അവന്‍ --> രാമന്‍   
അവള്‍ക്ക് --> സീതയെ   
അവള്‍ --> അവള്‍ക്ക് --> സീതയെ   
അത്‌ --> പൂവ്‌   
അവര്‍ --> സീതയെ , രാമന്‍   
അവിടെ --> കോട്ടയത്ത്‌   
അവര്‍ക്ക്‌ --> സീതയെ , രാമന്‍   
അവിടെ --> അവിടെ --> കോട്ടയത്ത്‌   
അവന്‍ --> രാമന്‍   
അവനെ --> ലക്ഷ്മണന്‍   
അവള്‍ക്ക്‌ --> സീതയെ   
ഈ --> വഴ   
അവര്‍ --> സീതയെ , ലക്ഷ്മണന്‍   
അവിടെ --> കോട്ടയം

My code is:

fq = codecs.open('outputfile5.txt', encoding='utf-8')
lines = fq.readlines()
fq.close()
fa = codecs.open('result.txt', 'w')
for line in lines:
    line1=[]
    line1=line.split()
    for i in line1:
        if u'-->' not in i or u',' not in i:
            s = re.match('([0-9]+)', i).group(1)
            word=i[len(s):]
            fa.write(word.encode('UTF-8'))
        else:
            fa.write(i.encode('UTF-8'))
fa.close()

While running the code it shows the following error:

s = re.match('([0-9]+)', i).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

How can i solve this ?

Because `re.match('([0-9]+)', i)` isn't matching – Ravi Dhoriya ツ May 06 '14 at 08:04 — Ravi Dhoriya ツ, May 06 '14 at 08:04
I changed that also but it didn't work. – user3251664 May 06 '14 at 08:19 — user3251664, May 06 '14 at 08:19

Tom Fenech · Accepted Answer · 2014-05-06T08:30:19.737

2

I'm not sure if I'm missing something obvious here but does this do what you want?

with open('outputfile5.txt') as input, open('result.txt', 'w') as output:
    for line in input:
        output.write(''.join([c for c in line if not c.isdigit()]))

result.txt:

അവന് --> രാമന്   
അവള്ക്ക് --> സീതയെ   
അവള് --> അവള്ക്ക് --> സീതയെ   
അത് --> പൂവ്   
അവര് --> സീതയെ , രാമന്   
അവിടെ --> കോട്ടയത്ത്   
അവര്ക്ക് --> സീതയെ , രാമന്   
അവിടെ --> അവിടെ --> കോട്ടയത്ത്   
അവന് --> രാമന്   
അവനെ --> ലക്ഷ്മണന്   
അവള്ക്ക് --> സീതയെ   
ഈ --> വഴ   
അവര് --> സീതയെ , ലക്ഷ്മണന്   
അവിടെ --> കോട്ടയം

edited May 06 '14 at 08:30

answered May 06 '14 at 08:12

Tom Fenech

72,334
12
107
141

You don't need the inner list in `join`, `''.join(c for c in line if not c.isdigit())` – Burhan Khalid May 06 '14 at 08:43
@Burhan I know, (in fact I've said the same to others :) but in the case of `join`, [list comprehensions are more efficient than generator expressions](http://stackoverflow.com/a/9061024/2088135). – Tom Fenech May 06 '14 at 08:47
@Burhan, the question is, to whom does that apply? :) – Tom Fenech May 06 '14 at 08:51

kmario23 · Answer 2 · 2014-05-06T08:40:09.217

0

You can simply do this

import re  
with open('outputfile5.txt') as inpf, open('result.txt', 'w') as outf:
for line in inpf:
   outf.write(re.sub('\d+', '', line))

edited May 06 '14 at 08:40

answered May 06 '14 at 08:22

kmario23

57,311
13
161
150

score 0 · Answer 3 · answered May 06 '14 at 08:23

0

How about a straight-forward

with codecs.open('outputfile5.txt', encoding='utf-8') as input:
  with codecs.open('result.txt', 'w', encoding='utf-8') as output:
    for line in input:
      output.write(re.sub(r'[0-9]*', '', line))

solution?

answered May 06 '14 at 08:23

Alfe

56,346
20
107
159

Remove unicode numbers in each line in the file

3 Answers3