Curious unapply of function with bigger pandas.DataFrame

Question

UPDATED ISSUE

As asked, I provide a reproducible example.
There are links to access a 1/6 of my dataframe (a serialized pandas.DataFrame object via Pickle) and a jupyter notebook to have a reproducible code, where there is a sample of the dataframe where the function correctly applies and the bigger dataframe where it does not.

Notice dropbox gonna says a view is not avaiblable but the files are, tell me if not.

ACIENT ISSUE, where the issue finally is not from `pool.map()`

Linked to this problematic , I use this method on a sample of the dataframe to see if it does the thing right, and it is the case as the following:

m = dfsample.Result.eq('Win')
s = m.shift().cumsum()
dfsample['gap_in_days'] = dfsample.groupby(['name', s])['Gap done'].cumsum() #"Expected Gap" in the linked topic
dfsample['nb_of_games'] = dfsample.assign(nb_of_games = 1).groupby('name')['nb_of games'].apply(lambda x:x.shift().cumsum()).fillna(0)
dfsample['gap_in_numbers'] = dfsample.assign(nb = 1).groupby(['name',s])['nb'].cumsum()

It renders what I expect:

+-----------+------------+---------------------+----------+-------------+-------------+----------------+
|    Player |   Result   |        Date         | Gap done | gap_in_days | nb_of_games | gap_in_numbers |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| K2000     | Lose       | 2015-11-13 13:42:00 |      0.0 |         0.0 |           0 | -1 *           |
| K2000     | Lose       | 2016-03-23 16:40:00 |    131.0 |       131.0 |           1 | 1              |
| K2000     | Lose       | 2016-05-16 19:17:00 |     54.0 |       185.0 |           2 | 2              |
| K2000     | Win        | 2016-06-09 19:36:00 |     54.0 |       239.0 |           3 | 3              |
| K2000     | Win        | 2016-06-30 14:05:00 |     54.0 |        54.0 |           4 | 1              |
| K2000     | Lose       | 2016-07-29 16:20:00 |     29.0 |        29.0 |           5 | 2              |
| K2000     | Win        | 2016-10-08 17:48:00 |     29.0 |        58.0 |           6 | 3              |
| Kssis     | Lose       | 2007-02-25 15:05:00 |      0.0 |         0.0 |           0 | 1 *            |
| Kssis     | Lose       | 2007-04-25 6:07:00  |     59.0 |        59.0 |           1 | 1              |
| Kssis     | Not-ranked | 2007-06-01 16:54:00 |     37.0 |        96.0 |           2 | 2              |
| Kssis     | Lose       | 2007-09-09 14:33:00 |     99.0 |       195.0 |           3 | 3              |
| Kssis     | Lose       | 2008-04-06 16:27:00 |    210.0 |       405.0 |           4 | 4              |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+

To explain the data, Gap done is the number of days between two different games. gap_in_days is the number of days until the player won a game. nb_of_games is obivous I guess. gap_in_numbers is the number of games played until the player won.
Notes: about the values with a *. I know these are weird results, but as I told to Andy L. this is correctable. I just replace by 0 when nb_of_games is 0. Besides, I show you it, because if you test you will obviously see it and get interrogations.

Now, when I apply the same thing in function with pool.map(function , iterable) it does not work, while applying the same function on the sample of dataframe dfsample is totally fine.

The function is the following:

def gap_nb(df):
    s = mask_result(df)
    df['gap_in_numbers'] = df.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
    return df

and the function mask_result is:

def mask_result(df):
    mask = df.Result.eq('P')
    s = mask.shift().cumsum()
    return s

Then after I use it with pool.map(function, iterable) as

dfs = pool.map(gap_nb , dfs) #where dfs is a list of slices of a big dataframe

it simply renders the column gap_in_numbers with 1 as:

+----------------+
| gap_in_numbers |
+----------------+
|              0 |
|              1 |
|              1 |
|              1 |
|              1 |
|            ... |
|              1 |
+----------------+

I tried to find some ways, like to use assign() in another function, then apply the cumsum() in another one, but it returns the same thing.

So, can anyone tell me why ?

Pandas version: 0.23.4 Python version: 3.7.4

Example data to play with (without the last column)

import io
s = '''Player,Result,Date,Gap,done,gap_in_days,nb_of_games
K2000,Lose,2015-11-13,13:42:00,0.0,0.0,0
K2000,Lose,2016-03-23,16:40:00,131.0,131.0,1
K2000,Lose,2016-05-16,19:17:00,54.0,185.0,2
K2000,Win,2016-06-09,19:36:00,54.0,239.0,3
K2000,Win,2016-06-30,14:05:00,54.0,54.0,4
K2000,Lose,2016-07-29,16:20:00,29.0,29.0,5
K2000,Win,2016-10-08,17:48:00,29.0,58.0,6
Kssis,Lose,2007-02-25,15:05:00,0.0,0.0,0
Kssis,Lose,2007-04-25,6:07:00,59.0,59.0,1
Kssis,Not-ranked,2007-06-01,16:54:00,37.0,96.0,2
Kssis,Lose,2007-09-09,14:33:00,99.0,195.0,3
Kssis,Lose,2008-04-06,16:27:00,210.0,405.0,4'''

df = pd.read_csv(io.StringIO(s))

In `gap_nb` should `df.assign(nb = 1).groupby(['name',s])['nb'].cumsum()` be `df.assign(nb = 1).groupby(['Player',s])['nb_of_games'].cumsum()` to match your example df? Also in `mask_result` should `df.Result.eq('P')` be `df.Result.eq('Win')`?? — wwii, Nov 24 '19 at 15:41
Do the two functions work as you expect when you pass your example dataframe to them (without using multiprocessing)? — wwii, Nov 24 '19 at 15:45
@wii you're right for one thing, I changed `df.Result.eq('P')` by `df.Result.eq('Win')`, that's just example are above all illustrative, real data are huger and not in English.
And for the first question, this is not interesting to cumsum `nb_of_games` for `gap_nb` because it makes a wrong result simply.
And the function which I have problem only has with multiprocessing. Strangely not others based on the same thing like `gap_in_days` that I did not write in the topic because it works fine. — AvyWam, Nov 24 '19 at 15:51
Using your example DataFrame and `with Pool(5) as p: print(p.map(gap_nb, [df]))` I get the same result as passing the DataFrame to `gap_nb` without using `multiprocessing`. I get good results with `p.map(gap_nb, [df.iloc[:7],df.iloc[7:]])` ... So it seems you haven't provided sufficient information - please read [mcve]. we need to be able to test and recreate your problem. — wwii, Nov 24 '19 at 15:53
@wii, maybe the problem occurs when the df is splitted, because this is actually the case. And this is rightly splitted. — AvyWam, Nov 24 '19 at 15:57
Certainly if you split is across `groupby` boundaries you will get inaccurate results. — wwii, Nov 24 '19 at 15:58
Without a suffcient [mcve] it's not possible to recreate your problem. If it is due to the way you are splitting up the Dataframe perhaps try - `pool.map(gap_nb, [thing[1] for thing in df.groupby('Player')]))` — wwii, Nov 24 '19 at 16:04
I provided what I thought was an answer, is it what you were looking for? — wwii, Nov 28 '19 at 20:34
@wwii it did not correspond to the results I expected, I did a list of dataframes where each one is for an unique `Player`. The production of this list is terribly high in cost (actually that's the longest, and very long one, operation in it), but it gives the result I expect, then I concatenate with `pd.concat(the_high_cost_list_of_dfs)`. — AvyWam, Nov 29 '19 at 16:13

Curious unapply of function with bigger pandas.DataFrame

UPDATED ISSUE

ACIENT ISSUE, where the issue finally is not from pool.map()

0 Answers0

ACIENT ISSUE, where the issue finally is not from `pool.map()`