IOError writing very large file

Question

I have a python module that is responsible for doing some preprocessing/tokenizing on a dataset that I want to use. The dataset is a 144M line text file that I read into memory, split into different pieces, shuffle, then write to new files. Previously, writing was done by the following function:

def write_lines(filename, lines):
with io.open(filename, 'w', encoding='utf-8') as fout:
    fout.write('\n'.join(lines))

When trying to do this on the 144M line dataset, I get the IOError: [Errno 22] error. However, there are no issues running the exact same code on a 6M line dataset. Before sending the dataset to this module, it is ran through a filtering service that ensures only the characters matching the pattern [\x00-\x7f] are in the file, as described in this post.

I am running python2.7 in an Anaconda environment. Some of the code I am using came from an open source project that performs some complicated string processing logic that does not work on Python3 no matter what I have tried, so switching to Python3 is not an option (if that would even help).

Is there any way a larger dataset could be causing this error? I would have thought that the only thing that could go wrong is a memory error but Errno 22 does not seem to have anything to do with memory.

Can the filter step be done separately and before shuffling? — Alex Reynolds, Dec 21 '16 at 21:55
Are you on OSX by any chance? If you are, look at this [related SO question](http://stackoverflow.com/questions/11662960/ioerror-errno-22-invalid-argument-when-reading-writing-large-bytestring) — jkr, Dec 21 '16 at 21:55
@AlexReynolds no, the filtering is done in a separate Java module (along with a lot of other logic) that depends on pre-existing java classes — jbird, Dec 21 '16 at 21:59

score 2 · Accepted Answer · edited May 23 '17 at 10:30

2

You don't need to join your lines in one big line. You are probably getting a line that is too long. Try this:

def write_lines(filename, lines):
    with io.open(filename, 'w', encoding='utf-8') as fout:
        for line in lines:
            fout.write(line + '\n')

And look through this question.

edited May 23 '17 at 10:30

Community

1
1

answered Dec 21 '16 at 21:55

Yevhen Kuzmovych

10,940
7
28
48

1

Or possibly more succinctly: `fout.writelines(line + '\n' for line in lines)` – SethMMorton Dec 21 '16 at 22:08

IOError writing very large file

1 Answers1