1

I do a split-apply-merge type of workflow with pandas. The 'apply' part returns a DataFrame. When the DataFrame I run gropupby on is firstly sorted, simply returning a DataFrame from apply raises ValueError: cannot reindex from a duplicate axis. Instead, I have found it to work properly when I return pd.concat([df]) (instead of just return df). If I don't sort the DataFrame, both ways of merging results work correctly. I expect sorting must be doing something to the index yet I don't understand what. Can someone please explain?

import pandas as pd
import numpy as np


def fill_out_ids(df, filling_function, sort=False, sort_col='sort_col',
                 group_by='group_col', to_fill=['id1', 'id2']):

    df = df.copy()
    df.set_index(group_by, inplace=True)
    if sort:
        df.sort_values(by=sort_col, inplace=True)
    g = df.groupby(df.index, sort=False, group_keys=False)
    df = g.apply(filling_function, to_fill)
    df.reset_index(inplace=True)
    return df


def _fill_ids_concat(df, to_fill):
    df[to_fill] = df[to_fill].fillna(method='ffill')
    df[to_fill] = df[to_fill].fillna(method='bfill')
    return pd.concat([df])


def _fill_ids_plain(df, to_fill):
    df[to_fill] = df[to_fill].fillna(method='ffill')
    df[to_fill] = df[to_fill].fillna(method='bfill')
    return df


def test_fill_out_ids():
    input_df = pd.DataFrame(
        [
            ['a',       None,       1.0,    1],
            ['a',       None,       1.0,    3],
            ['a',       'name1',    np.nan, 2],

            ['b',       None,       2.0,    3],
            ['b',       'name1',    np.nan, 2],
            ['b',       'name2',    np.nan, 1],
        ],
        columns=['group_col', 'id1', 'id2', 'sort_col']
    )

    # this works
    fill_out_ids(input_df, _fill_ids_plain, sort=False)

    # this raises: ValueError: cannot reindex from a duplicate axis
    fill_out_ids(input_df, _fill_ids_plain, sort=True)

    # this works
    fill_out_ids(input_df, _fill_ids_concat, sort=True)

    # this works
    fill_out_ids(input_df, _fill_ids_concat, sort=False)


if __name__ == "__main__":
    test_fill_out_ids()
Fryderyq
  • 317
  • 2
  • 10
  • Could you include a [mcve] for just the operation that is puzzling you? – wwii Apr 30 '18 at 18:59
  • wwii thanks for your feedback. I edited the question and also doing that, I have tracked down the reason for my confusion a little bit. Sorting the DataFrame seems to be important here. – Fryderyq May 01 '18 at 11:56
  • .. If you get the groups individually and call the plain version, they work ... `b = g.get_group('b'); _fill_ids_plain(b, to_fill)` .. but it doesn't like applying it to the groupby object – wwii May 01 '18 at 15:04
  • Totally by accident. – Fryderyq May 01 '18 at 15:19
  • Do you just want to know why this happens or are you looking for other solutions that work when the DataFrame is sorted.? – wwii May 01 '18 at 19:18
  • I want to understand why it works with `pd.concat()` and what the `sort_values` does to the index. Looking at the DataFrame in the debugger I don't see any obvious explanations. – Fryderyq May 01 '18 at 20:38
  • https://stackoverflow.com/a/34018827/2823755 – wwii May 01 '18 at 22:36
  • that's already something :) – Fryderyq May 02 '18 at 23:11

0 Answers0