0

I have a python module that is responsible for doing some preprocessing/tokenizing on a dataset that I want to use. The dataset is a 144M line text file that I read into memory, split into different pieces, shuffle, then write to new files. Previously, writing was done by the following function:

def write_lines(filename, lines):
with io.open(filename, 'w', encoding='utf-8') as fout:
    fout.write('\n'.join(lines))

When trying to do this on the 144M line dataset, I get the IOError: [Errno 22] error. However, there are no issues running the exact same code on a 6M line dataset. Before sending the dataset to this module, it is ran through a filtering service that ensures only the characters matching the pattern [\x00-\x7f] are in the file, as described in this post.

I am running python2.7 in an Anaconda environment. Some of the code I am using came from an open source project that performs some complicated string processing logic that does not work on Python3 no matter what I have tried, so switching to Python3 is not an option (if that would even help).

Is there any way a larger dataset could be causing this error? I would have thought that the only thing that could go wrong is a memory error but Errno 22 does not seem to have anything to do with memory.

Community
  • 1
  • 1
jbird
  • 506
  • 6
  • 21
  • Can the filter step be done separately and before shuffling? – Alex Reynolds Dec 21 '16 at 21:55
  • Are you on OSX by any chance? If you are, look at this [related SO question](http://stackoverflow.com/questions/11662960/ioerror-errno-22-invalid-argument-when-reading-writing-large-bytestring) – jkr Dec 21 '16 at 21:55
  • @Jakub Yes I am running the code on an iMac. – jbird Dec 21 '16 at 21:58
  • @AlexReynolds no, the filtering is done in a separate Java module (along with a lot of other logic) that depends on pre-existing java classes – jbird Dec 21 '16 at 21:59

1 Answers1

2

You don't need to join your lines in one big line. You are probably getting a line that is too long. Try this:

def write_lines(filename, lines):
    with io.open(filename, 'w', encoding='utf-8') as fout:
        for line in lines:
            fout.write(line + '\n')

And look through this question.

Community
  • 1
  • 1
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48