UPDATED ISSUE
As asked, I provide a reproducible example.
There are links to access a 1/6 of my dataframe (a serialized pandas.DataFrame
object via Pickle
) and a jupyter notebook
to have a reproducible code, where there is a sample of the dataframe where the function correctly applies and the bigger dataframe where it does not.
Notice dropbox gonna says a view is not avaiblable but the files are, tell me if not.
ACIENT ISSUE, where the issue finally is not from pool.map()
Linked to this problematic , I use this method on a sample of the dataframe to see if it does the thing right, and it is the case as the following:
m = dfsample.Result.eq('Win')
s = m.shift().cumsum()
dfsample['gap_in_days'] = dfsample.groupby(['name', s])['Gap done'].cumsum() #"Expected Gap" in the linked topic
dfsample['nb_of_games'] = dfsample.assign(nb_of_games = 1).groupby('name')['nb_of games'].apply(lambda x:x.shift().cumsum()).fillna(0)
dfsample['gap_in_numbers'] = dfsample.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
It renders what I expect:
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| Player | Result | Date | Gap done | gap_in_days | nb_of_games | gap_in_numbers |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| K2000 | Lose | 2015-11-13 13:42:00 | 0.0 | 0.0 | 0 | -1 * |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 | 1 | 1 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 | 2 | 2 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 | 3 | 3 |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 | 4 | 1 |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 | 5 | 2 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 | 6 | 3 |
| Kssis | Lose | 2007-02-25 15:05:00 | 0.0 | 0.0 | 0 | 1 * |
| Kssis | Lose | 2007-04-25 6:07:00 | 59.0 | 59.0 | 1 | 1 |
| Kssis | Not-ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 | 2 | 2 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 | 3 | 3 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 | 4 | 4 |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
To explain the data, Gap done
is the number of days between two different games. gap_in_days
is the number of days until the player won a game. nb_of_games
is obivous I guess. gap_in_numbers
is the number of games played until the player won.
Notes: about the values with a *. I know these are weird results, but as I told to Andy L. this is correctable. I just replace by 0 when nb_of_games
is 0. Besides, I show you it, because if you test you will obviously see it and get interrogations.
Now, when I apply the same thing in function with pool.map(function , iterable)
it does not work, while applying the same function on the sample of dataframe dfsample
is totally fine.
The function is the following:
def gap_nb(df):
s = mask_result(df)
df['gap_in_numbers'] = df.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
return df
and the function mask_result
is:
def mask_result(df):
mask = df.Result.eq('P')
s = mask.shift().cumsum()
return s
Then after I use it with pool.map(function, iterable)
as
dfs = pool.map(gap_nb , dfs) #where dfs is a list of slices of a big dataframe
it simply renders the column gap_in_numbers
with 1
as:
+----------------+
| gap_in_numbers |
+----------------+
| 0 |
| 1 |
| 1 |
| 1 |
| 1 |
| ... |
| 1 |
+----------------+
I tried to find some ways, like to use assign()
in another function, then apply the cumsum()
in another one, but it returns the same thing.
So, can anyone tell me why ?
Pandas version: 0.23.4 Python version: 3.7.4
Example data to play with (without the last column)
import io
s = '''Player,Result,Date,Gap,done,gap_in_days,nb_of_games
K2000,Lose,2015-11-13,13:42:00,0.0,0.0,0
K2000,Lose,2016-03-23,16:40:00,131.0,131.0,1
K2000,Lose,2016-05-16,19:17:00,54.0,185.0,2
K2000,Win,2016-06-09,19:36:00,54.0,239.0,3
K2000,Win,2016-06-30,14:05:00,54.0,54.0,4
K2000,Lose,2016-07-29,16:20:00,29.0,29.0,5
K2000,Win,2016-10-08,17:48:00,29.0,58.0,6
Kssis,Lose,2007-02-25,15:05:00,0.0,0.0,0
Kssis,Lose,2007-04-25,6:07:00,59.0,59.0,1
Kssis,Not-ranked,2007-06-01,16:54:00,37.0,96.0,2
Kssis,Lose,2007-09-09,14:33:00,99.0,195.0,3
Kssis,Lose,2008-04-06,16:27:00,210.0,405.0,4'''
df = pd.read_csv(io.StringIO(s))
And for the first question, this is not interesting to cumsum `nb_of_games` for `gap_nb` because it makes a wrong result simply.
And the function which I have problem only has with multiprocessing. Strangely not others based on the same thing like `gap_in_days` that I did not write in the topic because it works fine. – AvyWam Nov 24 '19 at 15:51