Supposing I have a very big text file consisting of many lines that I would like to reverse. And I don't care of the final order. The input file contains Cyrillic symbols. I use multiprocessing
to process on several cores.
I wrote such program:
# task.py
import multiprocessing as mp
POOL_NUMBER = 2
lock_read = mp.Lock()
lock_write = mp.Lock()
fi = open('input.txt', 'r')
fo = open('output.txt', 'w')
def handle(line):
# In the future I want to do
# some more complicated operations over the line
return line.strip()[::-1] # Reversing
def target():
while True:
try:
with lock_read:
line = next(fi)
except StopIteration:
break
line = handle(line)
with lock_write:
print(line, file=fo)
pool = [mp.Process(target=target) for _ in range(POOL_NUMBER)]
for p in pool:
p.start()
for p in pool:
p.join()
fi.close()
fo.close()
This program fails with error:
Process Process-2:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "task.py", line 22, in target
line = next(fi)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 0: invalid start byte
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "task.py", line 22, in target
line = next(fi)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
On the other hand, everything works fine if I set POOL_NUMBER = 1
. But it doesn't make a sense if I want to gain the total performance.
Why does that error happen? And how can I fix it?
I use Python 3.5.2
.
I generated data using this script:
# gen_file.py
from random import randint
LENGTH = 100
SIZE = 100000
def gen_word(length):
return ''.join(
chr(randint(ord('а'), ord('я')))
for _ in range(length)
)
if __name__ == "__main__":
with open('input.txt', 'w') as f:
for _ in range(SIZE):
print(gen_word(LENGTH), file=f)