I have a python module that is responsible for doing some preprocessing/tokenizing on a dataset that I want to use. The dataset is a 144M line text file that I read into memory, split into different pieces, shuffle, then write to new files. Previously, writing was done by the following function:
def write_lines(filename, lines):
with io.open(filename, 'w', encoding='utf-8') as fout:
fout.write('\n'.join(lines))
When trying to do this on the 144M line dataset, I get the IOError: [Errno 22]
error. However, there are no issues running the exact same code on a 6M line dataset. Before sending the dataset to this module, it is ran through a filtering service that ensures only the characters matching the pattern [\x00-\x7f]
are in the file, as described in this post.
I am running python2.7 in an Anaconda environment. Some of the code I am using came from an open source project that performs some complicated string processing logic that does not work on Python3 no matter what I have tried, so switching to Python3 is not an option (if that would even help).
Is there any way a larger dataset could be causing this error? I would have thought that the only thing that could go wrong is a memory error but Errno 22
does not seem to have anything to do with memory.