1
Input (a.txt) contains data as:
{person1: [www.person1links1.com]}

{person2: [www.person2links1.com,www.person2links2.com]}...(36000 lines of such data)

I am interested in extracting data from the personal links of each person and my code is:

def get_bio(authr,urllist):
    author_data=[]
    for each in urllist:
        try:
            html = urllib.request.urlopen(each).read()
            author_data.append(html)
        except:
            continue
    f=open(authr+'.txt','w+')
    for each in author_data:
        f.write(str(each))
        f.write('\n')
        f.write('********************************************')
        f.write('\n')
    f.close()
if __name__ == '__main__':
    q=mp.Queue()
    processes=[]
    with open('a.txt') as f:
        for each in f:
            q.put(each)# dictionary
    while (q.qsize())!=0:
        for authr,urls in q.get().items():
            p=mp.Process(target=get_bio,args=(authr,urls))
            processes.append(p)
            p.start()
    for proc in processes:
        proc.join()

i am getting the following error while running this code( i have tried setting ulimit but encountering same error):

OSError: [Errno 24] Too many open files: 'personx.txt'
Traceback (most recent call last):
  File "perbio_mp.py", line 88, in <module>
    p.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
    parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files

Please point out where i am wrong and how can i correct this. Thanks

user7238503
  • 61
  • 1
  • 7
  • Use a https://docs.python.org/3.5/library/concurrent.futures.html#processpoolexecutor with a sane value for max workers. (36k+ processes will almost never be the correct way to do something). That said, your code "should" work with a high enough limit for open files. – folkol Jul 15 '18 at 09:04
  • In addition to the other answers, perhaps [limit number of concurrent workers](https://stackoverflow.com/questions/20886565/using-multiprocessing-process-with-a-maximum-number-of-simultaneous-processes). – tomyl Jul 15 '18 at 09:32

3 Answers3

1

I changed my ulimit to 4096 from 1024 and it worked.

ulimit -n

For me it was 1024, and I updated it to 4096 and it worked.

ulimit -n 4096
devil in the detail
  • 2,905
  • 17
  • 15
0

Check the maximun number of file descriptors of your os. Some versions of macosx have a discrete limit of 256 files like El Capitan 10.10

In any case you could run the command:

ulimit -n 4096

before running your python code.

If your code still breaks, check how many times is called the method def get_bio(authr,urllist) of your code. It could happen that your loop opens more files than your os can handle.

Evhz
  • 8,852
  • 9
  • 51
  • 69
0

urlopen returns a response object that wraps an open file. Your code is not closing these files, hence the problem.

The response object is also a context manager, so instead of

    html = urllib.request.urlopen(each).read()
    author_data.append(html)

you can do

with urllib.request.urlopen(each) as response:
    author_data.append(response.read())

to ensure that the file is closed after reading.

Also, as folkol observes in the comments, you should reduce the number of active processes to a sane amount as each one will open files at the OS level.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153