0

I am having a little problem with pandas.concat

Namely, I am concatenating a dataframe with 3 series. The 1 dataframe and 2 of the series are concatenating as expected. One series, however is being attached to the bottom of my new data frame instead of as a column.

Here is my minimal working example. To get the output below, run it on the titanic Kaggle dataset.


#INCLUDED ONLY SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE.
def bin_dump(data, increment):
    if data <= increment:
        return f'0 - {increment}'
    if data % increment == 0:
        return f'{data - increment} - {data}'
    else: 
        m = data % increment
        a = data - m
        b = data + (increment - m)
        return f'{a} - {b}'

#INCLUDED SO MY CODE WILL RUN ON YOUR MACHINE. IGNORE
train_df['AgeGroup'] = train_df.apply(lambda x: bin_dump(x.Age, 3), axis=1)

# THE PROBLEM IS ACTUALLY IN THIS METHOD:
def plot_dists(X, Y, input_df, percent_what):


    totals = input_df[X].value_counts()
    totals.name = 'totals'

    df = pd.Series(totals.index).str.extract(r'([0-9]+)').astype('int64')
    df.columns=['index']

    values = pd.Series(totals.index, name=X)

    percentages = []
    for group, total in zip(totals.index, totals):
        x = input_df.loc[(input_df[X] == group)&(input_df[Y] == 1), Y].sum()
        percent = 1 - x/total
        percentages.append(percent)

    percentages = pd.Series(percentages, name='Percentages')

    # THE PROBLEM IS HERE:
    df = pd.concat([df, values, totals, percentages], axis=1).set_index('index').sort_index(axis=0)

return df

output looks like this:

      AgeGroup        totals    Percentages
index           
0.0   0 - 3           NaN       0.333333
3.0   3.0 - 6.0       NaN       0.235294
6.0   6.0 - 9.0       NaN       0.666667
9.0   9.0 - 12.0      NaN       0.714286
12.0  12.0 - 15.0     NaN       0.357143
15.0  15.0 - 18.0     NaN       0.625000
18.0  18.0 - 21.0     NaN       0.738462
21.0  21.0 - 24.0     NaN       0.57534
.         .            .          .
.         .            .          . 
.         .            .          .  
NaN       NaN         11.0       NaN
NaN       NaN         15.0       NaN
NaN       NaN         9.0        NaN
NaN       NaN         6.0        NaN

So, the 'totals' are being appended as a dataframe on the bottom.

In addition to trying to fix this concat/append issue, I'd welcome any suggestions on how to optimize my code. This is my first go at building my own tool for visualizing data (I cut out the plotting part because it's not really part of the question).

rocksNwaves
  • 5,331
  • 4
  • 38
  • 77
  • It seems that you have a problem with your indices in your `totals` column. – Sebastián V. Romero Mar 28 '20 at 02:48
  • @SebastiánV.Romero I would like the indices to be ignored, I only want to concatenate the data. I saw that there is an 'ignore_index' argument. I will try that. – rocksNwaves Mar 28 '20 at 02:49
  • Comment what you get when you have it. Also, it would be great to have your data to play with and a minimal working example of your code, if you can :D – Sebastián V. Romero Mar 28 '20 at 02:56
  • @SebastiánV.Romero I will cut out the comments? but for my code to work you will need another function I wrote that dumps continuous data into bins with 'ranges'. My code will get a lot less "minimal" lol. I'll work on it now. ignoring index gave me an error. At least the code above runs :/ – rocksNwaves Mar 28 '20 at 03:01
  • 1
    I understand, but we can play with your inputs in this function avoiding how you obtained them, do you agree? :D If not, we can play with some dummy values. What is the error that you get? Try to use merge instead of concat. (sorry but I don't have now a computer to try it) – Sebastián V. Romero Mar 28 '20 at 03:08
  • @SebastiánV.Romero Code updated to run. The error I get is a key error in the plotting portion of my method, which is not present in order to keep things "minimal": `KeyError: "None of ['index'] are in the columns"` – rocksNwaves Mar 28 '20 at 03:12
  • This whole thing is a mess but if you don't care about indices just set the series index to the dataframe index and then add the series as a column via assignment. – CJR Mar 28 '20 at 03:16
  • This whole example is a mess because it's long and hard for me to follow and you only want help with a small part of it. I'm not sure what's important and what isn't. As a result I'm not sure if my suggestion even fixes your problem. There's not enough here to judge if the project itself is going well or not though. – CJR Mar 28 '20 at 03:41
  • 1
    Sorry again, but it's impossible to give an answer with what you put in your question. Your code it's not working because `train_df` is not defined and moreover without data (write a link of the titanic Kaggle dataset that you're using) we can't reproduce your error. `@rocksNwaves` – Sebastián V. Romero Mar 28 '20 at 16:33

1 Answers1

0

Check this out. Did you try to change from concat to merge?

Henrique Branco
  • 1,778
  • 1
  • 13
  • 40