Efficient file reading in python with need to split on '\n'

Question

I've traditionally been reading in files with:

file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('\n')
for record in delimited[1:]:
    record_split = record.split(',')

and

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
  datalines = (line.rstrip('\r\n') for line in data)
  for record in datalines:
    split_line = record.split(',')
    if len(split_line) > 1:

But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on '\n'.

Here is the multiprocessing code:

pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)     
while not op_list.ready():
  print("Number of files left to process: {}".format(op_list._number_left))
  time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()

Here is the error log

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
    task = get()
MemoryError

I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:

pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre

But here are the error messages that I get:

Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b
uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master

Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp
tyuser\ppft\setup.py) egg_info for package ppft

    warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth
on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho
n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
  Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
 (from pathos==0.2a1.dev0)
  Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)

Storing debug log for failure in C:\Users\xxx\pip\pip.log

I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.

But Now I have a failure as I cannot open that many files:

Finished reading in Exposures...
Reading Samples from:  C:\XXX\XXX\XXX\
Traceback (most recent call last):
  File "events.py", line 568, in <module>
    mdrcv_dict = ReadDamages(damage_dir, value_dict)
  File "events.py", line 185, in ReadDamages
    res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
  File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr
ocessing.py", line 230, in amap
    return _pool.map_async(star(f), zip(*args)) # chunksize
  File "events.py", line 184, in <genexpr>
    files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\\xx.csv'

Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?

def PP_star(args_flat):
    return PP(*args_flat)

def PP(pathfilename, txtdatapath, my_dict):
    return com_dict

fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)

How can I perform the same function with pathos.multiprocessing

Please fix your indention. And show us the multiprocessing code you use. — , Feb 19 '15 at 16:21
Just iterate over the open file, it is the default behavior to split at line endings. Also, looks like you are parsing a CSV file, have you seen the `csv` module? — Paulo Scardine, Feb 19 '15 at 16:22
Can you post the code relating to the multiprocessing/multithreading (which is it?) — Tom Dalton, Feb 19 '15 at 16:22
@PauloScardine I tried the CSV module, but I get the same issue. — disruptive, Feb 20 '15 at 11:54
Note that if the task seems to be I/O bound, and multithreading is indicated for CPU bound tasks. You can read simultaneously from several files at the same time without multithreading, if fact it is probably better to simply open several files and iterate over them in the same thread. — Paulo Scardine, Feb 20 '15 at 14:26
See my comment on my answer below… but if you are limited by the number of files you can open… that's an easy fix. — Mike McKerns, Feb 20 '15 at 17:54

score 1 · Answer 1 · answered Feb 19 '15 at 16:24

1

just iterate over the lines, instead of reading the whole file. like this

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
    for dataline in data:
        split_line = record.split(',')
        if len(split_line) > 1:

answered Feb 19 '15 at 16:24

gefei

18,922
9
50
67

score 1 · Answer 2 · edited May 23 '17 at 11:43

1

Let's say we have file1.txt:

file2.txt:

and so on, through file5.txt:

I'd suggest to use a hierarchical parallel map to read your files quickly. A fork of multiprocessing (called pathos.multiprocessing) can do this.

>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']

However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap) instead of an "asynchronous" map (amap). See this post for details: Python multiprocessing - tracking the process of pool.map operation

Get pathos here: https://github.com/uqfoundation

edited May 23 '17 at 11:43

Community

1
1

answered Feb 19 '15 at 17:35

Mike McKerns

33,715
8
119
139

Thanks, but I want to do more than just strip lines in parallel. Will this help me avoid the memory issues, i.e. should I simply use this code to generate the list data from each file and then process with multi-processing. – disruptive Feb 20 '15 at 10:32
I did try installing pathos, but having issues with pip install – disruptive Feb 20 '15 at 10:34
Even though `pathos` almost ten years old, it's latest release is a bit stale… and before the PEP that standardized version numbers. So that's the issue with the `pip` install. A new `pip` installable release is imminent, but in the meantime you can get it from https://github.com/uqfoundation, where the install is relatively painless as it's pure python. – Mike McKerns Feb 20 '15 at 13:01
1

If you want to do more than just split the lines, then you just need to modify the `rstrip` function I've provided. You could replace my `rstrip` function with your data processing function. The point is that this code reads lines one-at-a-time from several files in parallel… how you augment it after that is up to you. If the collective data in the files is super big, then you can't just read the data as I've done… you should augment `rstrip` to process the data *or* apply a reducer (like `sum` or `reduce_my_data` or whatever) inside the `map` call. – Mike McKerns Feb 20 '15 at 13:06
BTW, if you do get the code from github, you can use `pip` or `easy_install`, you just have to use the `pre` flag with `pip`, as the version is technically a prerelease. – Mike McKerns Feb 20 '15 at 13:29
1

I tried using pip, but I get the following issue: No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) – disruptive Feb 20 '15 at 15:08
I have this installed, but it won't handle the 2,000 files that I need to process. – disruptive Feb 20 '15 at 17:30
1

If you are limited by the number of files you can open at once, then you can make a simple modification to split up `fnames` and `files` into say 500 or 100 files at a time. You could put my above code in a for loop or a blocking `map` function. – Mike McKerns Feb 20 '15 at 17:55
1

@Navonod how did you solve No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0) issue? Replaying to my own question: pyre==0.8.2.0-pathos - this one was install from http://danse.cacr.caltech.edu/packages/dev_danse_us/ as pip install failed Let it sit here, for readers convenience. – Moonwalker Apr 01 '15 at 17:51
The `pyre` dependency has been removed, and the install is much simpler now. – Mike McKerns Aug 11 '15 at 10:49

score 0 · Answer 3 · answered Feb 19 '15 at 16:32

0

Try this:

for line in file('file.txt'):
    print line.rstrip()

of course instead of printing them you could also add them to a list or perform some other operation on them

answered Feb 19 '15 at 16:32

Dominik Schmidt

537
1
8
22

1

According to [python's documentation](https://docs.python.org/2/library/functions.html#file): > When opening a file, it’s preferable to use open() instead of invoking this constructor directly. file is more suited to type testing (for example, writing isinstance(f, file)). – erik-e Feb 19 '15 at 17:38

Efficient file reading in python with need to split on '\n'

3 Answers3

Linked