Python Pandas Segmentation Fault - Summing Columns Together

Question

I am working on a project for daily fantasy sports.

I have a dataframe containing possible lineups in it (6 columns, 1 for each player in a lineup).

As part of my process, I generate a possible fantasy point value for all players.

Next, I want to total the points scored for a lineup in my lineups dataframe by referencing the fantasy points dataframe.

For reference:

Lineups Dataframe: columns = F1, F2, F3, F4, F5, F6 where each column is a player's name + '_' + their player id
Fantasy Points Dataframe: columns = Player + ID, Fantasy Points

I go column by column for the 6 players to get the 6 fantasy points values:

for col in ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']:
    lineups = lineups.join(sim_data[['Name_SlateID', 'Points']].set_index('Name_SlateID'), how='left', on=f'{col}', rsuffix = 'x')

Then, in what I thought would be the simplest part, I try to sum them up and I get Segmentation Fault: 11

sum_columns = ['F1_points', 'F2_points', 'F3_points', 'F4_points', 'F5_points', 'F6_points']

lineups = reduce_memory_usage(lineups)

lineups[f'sim_{i}_points'] = lineups[sum_columns].sum(axis=1, skipna=True)

reduce_memory_usage comes from this article: https://towardsdatascience.com/6-pandas-mistakes-that-silently-tell-you-are-a-rookie-b566a252e60d

I have reduced the memory of the dataframe by 50% before running this line by choosing correct dtypes, I have tried using pd.eval() instead, I have tried summing the columns one by one via a for loop and nothing ever seems to work.

Any help is greatly appreciated!

Edit: Specs: OS - MacOS Monterey 12.2.1, python - 3.8.8, pandas - 1.4.1

Here are the details of my lineups dataframe right before the line causing the error:

Data columns (total 27 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   F1                  107056 non-null  object 
 1   F2                  107056 non-null  object 
 2   F3                  107056 non-null  object 
 3   F4                  107056 non-null  object 
 4   F5                  107056 non-null  object 
 5   F6                  107056 non-null  object 
 6   F1_own              107056 non-null  float16
 7   F1_salary           107056 non-null  int16  
 8   F2_own              107056 non-null  float16
 9   F2_salary           107056 non-null  int16  
 10  F3_own              107056 non-null  float16
 11  F3_salary           107056 non-null  int16  
 12  F4_own              107056 non-null  float16
 13  F4_salary           107056 non-null  int16  
 14  F5_own              107056 non-null  float16
 15  F5_salary           107056 non-null  int16  
 16  F6_own              107056 non-null  float16
 17  F6_salary           107056 non-null  int16  
 18  total_salary        107056 non-null  int32  
 19  dupes               107056 non-null  float32
 20  over_600_frequency  107056 non-null  int8   
 21  F1_points           107056 non-null  float16
 22  F2_points           107056 non-null  float16
 23  F3_points           107056 non-null  float16
 24  F4_points           107056 non-null  float16
 25  F5_points           107056 non-null  float16
 26  F6_points           107056 non-null  float16
dtypes: float16(12), float32(1), int16(6), int32(1), int8(1), object(6)
memory usage: 10.3+ MB

What is the shape or your dataframe? and provide df.info() too. — Scott Boston, Mar 26 '22 at 16:32
How does it work when you leave floats as floats, and not float32 or float16 (i.e., skip the memory reduction part, which I'm not sure is worth it anyway)? Since the segmentation faults seems to occur when summing float16 values. — 9769953, Mar 26 '22 at 16:47
You should add the Pandas version, and probably the Python and OS version, to your question. If you can make it reproducible, with e.g. settings the values to something like `np.linspace(1, 10, 107056, dtype=np.float16)`, possibly leave out of some columns etc etc, that would be even better. — 9769953, Mar 26 '22 at 16:51
@9769953 OS - MacOS Monterey 12.2.1, python - 3.8.8, pandas - 1.4.1. about to step out but will try to provide stuff to reproduce when i return. I can get the code to run without problems sometimes but it is not consistent and always eventually gives the Segmentation Fault — afed15, Mar 26 '22 at 17:00
@9769953 also, I did run it without the memory reduction part and got the same error so I tried adding it in case it would help at all. speed increased but still got the same issue — afed15, Mar 26 '22 at 17:10

score 0 · Answer 1 · answered Mar 26 '22 at 17:35

Segmentation fault 11 means you're using about 8gb of memory. As a backup plan, there are cloud solutions (e.g. AWS, GCP, Azure) that will give you more than enough memory, Colab is free and might be enough for your needs.

As far as fixing the underlying problem, it might be impossible to use pandas here if your dateset is too big. I would also see if you could store sim_data[['Name_SlateID', 'Points']] in memory so it doesn't recompute, and you can delete already joined dataframes like this. Does any of that help?

Thank you for the help, I will try this later and report back! However, I did try both Colab and Jupyter Notebooks and had the same issues — afed15, Mar 26 '22 at 17:45

Python Pandas Segmentation Fault - Summing Columns Together

1 Answers1