2

I am trying to build a big dataframe using a function that takes some arguments and return partial dataframes. I have a long list of documents to parse and extract relevant information that will go the big dataframe, and I am trying to do it using multiprocessing.Pool to make the process go faster.

My code looks like this :

    from multiprocessing import *
    from settings import *
    
    server_url = settings.SERVER_URL_UAT
    username =   settings.USERNAME
    password =   settings.PASSWORD
    
    def wrapped_visit_detail(args): 
        
        global server_url
        global username
        global password
        
        # visit_detail return a dataframe after consuming args
    
        return visit_detail(server_url, args, username, password) 
    
    # Trying to pass a list of arguments to wapped_visit_detail
    
    visits_url = [doc1, doc2, doc3, doc4]
    
    df = pd.DataFrame()
    pool = Pool(cpu_count())
    df = pd.concat( [ df,
                      pool.map( wrapped_visit_detail,
                                visits_url
                                )
                      ],
                    ignore_index = True
                    )

When I run this, I got this error

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2c88a43208>'. Reason: 'TypeError("can't pickle _thread._local objects",)'

EDIT

To illustrate my problem I created this simple figure

enter image description here

This is painfully slow and not scalable at all

And I am looking to make the code not serial but rather as parallelized as possible

enter image description here

Thank you all for your great comments so far, yes, I a using shared variable as parameters to this function that pulls the files and extract the individuals dataframes, it seems ot be my issue indeed

I am suspecting something wrong in the way I call pool.map()

Any tip would be really welcome

Rad
  • 989
  • 3
  • 14
  • 31
  • Which OS are you using? I was never able to use pool.map() on Windows. – Ssayan Mar 03 '22 at 13:08
  • 1
    Hi Ssayan, I am using Ubuntu – Rad Mar 03 '22 at 13:14
  • 1
    I can't fully reproduce as I don't have what you import but `pool.map() ` returns a list so I would suggest you do this `df = pd.concat(pool.map(wrapped_visit_detail, visits_url), ignore_index=True)`. When I use what you did I get an error but not the same so I don't think that would solve fully your issue. According to [this thread](https://stackoverflow.com/questions/55708455/typeerror-cant-pickle-thread-local-objects-when-using-dask-on-pandas-datafra) your error may come from the use of global variables. – Ssayan Mar 03 '22 at 13:29
  • I also use `multithreading` in the context of processing `pandas.DataFrame` and never got an error like this. Your error (about pickeling) indicates that the data transfered between to processes is not pickable. See https://docs.python.org/library/pickle.html to find out which data types are pickable). Maybe there is more then just a `DataFrame`? Or the cells in the `DF` containes unpickable types. – buhtz Mar 03 '22 at 13:57

1 Answers1

2

You may have realised on your own, that there is actually no "sharing" possible, among Python-interpreter and its sub-processes. No sharing, only absolutely independent and "disconnected" replicas of the __main__'s original, (in Windows a complete, top-down) state-full copy of the Python-interpreter, with all its internal variables, modules and whatsoever. Any change of there copies is not propagated back into the __main__ or elsewhere. Once more, if you try to compose "Big dataframe" from ~ +100k individually pre-produced "partial dataframes", you will get an awfully if not unacceptably low performance.

Losing advantages from partial-producers' latency-masking plus headbanging into the RAM-allocation costs and potentially even a need to turn memory-I/O many times worse, falling into a trap of 10,000x slower physical/virtual-memory swapping, as in-RAM capacities ceased to be able to hold all data - as might happen upon ex-post attempt to
pd.concat( _a_HUGE_in_RAM_store_first_LIST_of_ALL_100k_plus_partial_DFs_, ... )
which is (unless you test Stack Overflow sponsors of Knowledge sense of humour and patience of others)
a no go ANTI-pattern.


Q :
" Any tip ... "

A :
No matter how simple this one is, here, due to Python-interpreted, process-to-process communication constraints

( unhandled EXC reporter says) :

multiprocessing.
           pool.MaybeEncodingError:

Error sending result:
'<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2c88a43208>'.

Reason:
'TypeError("can't pickle _thread._local objects",)'

So,
here we are. Python-interpreter has to use "pickle"-like SER/DES whenever it tries to send/receive as single bit of data from one process ( typ. the __main__ upon sending launch parameters, or worker-processes upon sending their remote results' objects back to the __main__ ) to another.

That's fine, whenever the SER/DES-serialisation is possible ( here not being the case )

Options :

  • Best avoid any and all object-passing ( objects are prone to SER/DES-failures )
  • If obsessed with passing them, try some more capable SER/DES-encoder, replacing a default pickle with import dill as pickle has saved me many times ( yet not in every case, see above )
  • Design better problem-solving strategy, that does not rely on ill-assumed "shared"-use of Pandas dataframe amongst more (fully independent) sub-processes, that cannot and do not access "the same" dataframe, but theirs locally-isolated replica thereof (re-read more about sub-process independence and memory-space separation)

Nota bene:
School-book SLOC-s are nice in school-book sized examples, yet are awfully expensive performance ANTI-patterns in real-world use cases, the more in production-grade code-execution.

Tip:

If performance is The Target,
produce independent file-based results and join them afterwards. "Big dataframe" composition from "partial dataframes" represent a mix of sins, that cause you many performance ANTI-pattern problems, the failure to SER/DES-pickle being just a one, a small one (visible to naked eye)

Decisions ( & the costs associated with making them ) are in all cases on you.

You might like some further reads on how to boost performance.

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92
  • I'm the author of `dill`. With regard to your second "option": I'd suggest trying `multiprocess`, a fork of `multiprocessing` that uses `dill`, and can utilize `dill.settings` for serialization variants. – Mike McKerns Mar 03 '22 at 16:56
  • Sure,I know @MikeMcKerns,we've already discussed `dill`-related issues here already in past & I keep recommending `dill` for using as a smart SER/DES replacement IIRC since ~2008 (the time is so fast...). Here, the O/P problem is to re-factor the overall strategy of processing, if performance boost is to get achieved (doable). A move from `multiprocessing` to `multiprocess` or `joblib` does not help for achieving a reasonable End-2-End processing performance (HPC-grade processing speeds being far, indeed far ahead. The O/P struggles instead with a 1st visible show-stopper, RAM-swaps are next ) – user3666197 Mar 03 '22 at 18:16
  • Yeah, I just mentioned it more due to the fact the OP is attempting to use `multiprocessing` and you mentioned object-sharing across processes. Were it me, and I had an array (or `pandas.Series`) to share, I might instead use a shared memory array from `multiprocess` -- which can be quite fast. You have to use `Process` and `Lock` objects and the like, but it's worth it. I'm sure it's been done with a `pandas.DataFrame`, but I've never tried it (though I've used a 2D `numpy` array). There's also the alternative of using a `map` (or similar) built on top of `mpi4py`. Ship near-zero data. – Mike McKerns Mar 04 '22 at 00:31