1

For the following dataframe: every group of c should have three values of b. The second value of a should be the average of the first and third value of a.

What is the easiest way to insert the "missing" row with a=48, b=42, c=4 between index = 0 and index = 1?

df_x = pd.DataFrame({"a": [47, 49, 55, 54, 53, 24, 27, 30], "b": [41, 43, 51, 52, 53, 41, 42, 43], "c": [4, 4, 5, 5, 5, 4, 4, 4]})
df_x
Out[14]: 
    a   b  c
0  47  41  4
1  49  43  4
2  55  51  5
3  54  52  5
4  53  53  5
5  24  41  4
6  27  42  4
7  30  43  4

If I use groupby('c').transform(my_func) or groupby('c').apply(my_func), I face the situation that the first call to my function my_func is done twice.

  • How did you determine that `42` is missing and not anything else? – cs95 Jan 15 '18 at 21:03
  • Hi @cᴏʟᴅsᴘᴇᴇᴅ, this is just what the functional requirement for this dataframe foresees. I have to correct missing data to process it further (and this example is a simplification of the real setup). –  Jan 15 '18 at 21:06
  • I'm afraid that you've oversimplified your question to the point that any answer provided may not generalise to your case, so I'd recommend broadening it a bit. – cs95 Jan 15 '18 at 21:07
  • The first call to `my_func` is done twice indeed. Pandas does this in order to determine whether it could take a faster code path. This is documented [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply) (see the warning box at the end of the subsection). – Hristo Iliev Jan 15 '18 at 21:11
  • Thanks for the edit. Is it just one insertion, or do you have to do many like this? – cs95 Jan 15 '18 at 21:21
  • @cᴏʟᴅsᴘᴇᴇᴅ: the DataFrame is much longer. But the structure of the other rows is exactly like the rows shown. –  Jan 15 '18 at 21:26
  • Possible duplicate of [Insert a row to pandas dataframe](https://stackoverflow.com/questions/24284342/insert-a-row-to-pandas-dataframe) – Tai Jan 15 '18 at 23:20

2 Answers2

1

pandas's insert method only works for columns. We can use numpy.insert. Cons: this will create a new dataset. This should serve as an alternative to pd.concat or pd.append or pd.merge.

df_x = pd.DataFrame({"a": [47, 49, 55, 54, 53, 24, 27, 30], "b": [41, 43, 51, 52, 53, 41, 42, 43], "c": [4, 4, 5, 5, 5, 4, 4, 4]})

pd.DataFrame(np.insert(df_x.values, 1, values=[48, 42, 4], axis=0))


    0   1   2
0   47  41  4
1   48  42  4
2   49  43  4
3   55  51  5
4   54  52  5
5   53  53  5
6   24  41  4
7   27  42  4
8   30  43  4

In np.insert(df_x.values, 1, values=[48, 42, 4], axis=0), 1 tells the function the place/index you want to place the new values.

Tai
  • 7,684
  • 3
  • 29
  • 49
  • the idea was to have a generic solution - whenever a row in the middle of the group is missing, it should be added accordingly. –  Jan 15 '18 at 22:35
  • @user733733 You can do it with a right index? Why this is not generic? – Tai Jan 15 '18 at 22:44
1

a semipathetic solution would be the following.

Does anybody have ideas how to make the for loop more efficient / respectively avoid it at all?

import pandas as pd
df_x = pd.DataFrame({"a": [47, 49, 55, 54, 53, 24, 27, 30], "b": [41, 43, 51, 52, 53, 41, 42, 43], "c": [4, 4, 5, 5, 5, 4, 4, 4]})
df_x
Out[11]: 
    a   b  c
0  47  41  4
1  49  43  4
2  55  51  5
3  54  52  5
4  53  53  5
5  24  41  4
6  27  42  4
7  30  43  4
# make new column that allows group by
df_x['cumsum']=(df_x.c != df_x.c.shift()).cumsum()
df_x
Out[14]: 
    a   b  c  cumsum
0  47  41  4       1
1  49  43  4       1
2  55  51  5       2
3  54  52  5       2
4  53  53  5       2
5  24  41  4       3
6  27  42  4       3
7  30  43  4       3
# introduce index spreaded by 10
df_x['index10'] = df_x.index * 10
print(df_x)
    a   b  c  cumsum  index10
0  47  41  4       1        0
1  49  43  4       1       10
2  55  51  5       2       20
3  54  52  5       2       30
4  53  53  5       2       40
5  24  41  4       3       50
6  27  42  4       3       60
7  30  43  4       3       70
groupby = df_x.groupby("cumsum")
# initialize dataframe for new row
df_x_append = pd.DataFrame()
for key, item in groupby:
    # sub dataframe is too small
    if item.shape[0]!=3:
        # new entry will be in the middle of existing values
        my_index = item.index10[0]+5
        # create temporary dataframe
        df_x_single = pd.DataFrame({"index10":[my_index], "a": [(item.a[0]+item.a[1])/2], "b": [(item.b[0]+item.b[1])/2],"c":[item.c[0]]})
        # append this dataframe
        df_x_append = df_x_append.append(df_x_single)
df_x=df_x.append(df_x_append)
# sort by spreaded index
df_x=df_x.sort_values(by='index10', ascending=True, na_position='first')
print(df_x)
      a     b  c  cumsum  index10
0  47.0  41.0  4     1.0        0
0  48.0  42.0  4     NaN        5
1  49.0  43.0  4     1.0       10
2  55.0  51.0  5     2.0       20
3  54.0  52.0  5     2.0       30
4  53.0  53.0  5     2.0       40
5  24.0  41.0  4     3.0       50
6  27.0  42.0  4     3.0       60
7  30.0  43.0  4     3.0       70
# set spreaded index and remove it
df_x=df_x.set_index('index10')
df_x = df_x.reset_index().drop(["index10"], axis=1)
print(df_x)
      a     b  c  cumsum
0  47.0  41.0  4     1.0
1  48.0  42.0  4     NaN
2  49.0  43.0  4     1.0
3  55.0  51.0  5     2.0
4  54.0  52.0  5     2.0
5  53.0  53.0  5     2.0
6  24.0  41.0  4     3.0
7  27.0  42.0  4     3.0
8  30.0  43.0  4     3.0