I am having a little problem with pandas.concat
Namely, I am concatenating a dataframe with 3 series. The 1 dataframe and 2 of the series are concatenating as expected. One series, however is being attached to the bottom of my new data frame instead of as a column.
Here is my minimal working example. To get the output below, run it on the titanic Kaggle dataset.
#INCLUDED ONLY SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE.
def bin_dump(data, increment):
if data <= increment:
return f'0 - {increment}'
if data % increment == 0:
return f'{data - increment} - {data}'
else:
m = data % increment
a = data - m
b = data + (increment - m)
return f'{a} - {b}'
#INCLUDED SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE
train_df['AgeGroup'] = train_df.apply(lambda x: bin_dump(x.Age, 3), axis=1)
# THE PROBLEM IS ACTUALLY IN THIS METHOD:
def plot_dists(X, Y, input_df, percent_what):
totals = input_df[X].value_counts()
totals.name = 'totals'
df = pd.Series(totals.index).str.extract(r'([0-9]+)').astype('int64')
df.columns=['index']
values = pd.Series(totals.index, name=X)
percentages = []
for group, total in zip(totals.index, totals):
x = input_df.loc[(input_df[X] == group)&(input_df[Y] == 1), Y].sum()
percent = 1 - x/total
percentages.append(percent)
percentages = pd.Series(percentages, name='Percentages')
# THE PROBLEM IS HERE:
df = pd.concat([df, values, totals, percentages], axis=1).set_index('index').sort_index(axis=0)
return df
output looks like this:
AgeGroup totals Percentages
index
0.0 0 - 3 NaN 0.333333
3.0 3.0 - 6.0 NaN 0.235294
6.0 6.0 - 9.0 NaN 0.666667
9.0 9.0 - 12.0 NaN 0.714286
12.0 12.0 - 15.0 NaN 0.357143
15.0 15.0 - 18.0 NaN 0.625000
18.0 18.0 - 21.0 NaN 0.738462
21.0 21.0 - 24.0 NaN 0.57534
. . . .
. . . .
. . . .
NaN NaN 11.0 NaN
NaN NaN 15.0 NaN
NaN NaN 9.0 NaN
NaN NaN 6.0 NaN
So, the 'totals' are being appended as a dataframe on the bottom.
In addition to trying to fix this concat/append issue, I'd welcome any suggestions on how to optimize my code. This is my first go at building my own tool for visualizing data (I cut out the plotting part because it's not really part of the question).