Python, Pandas to calculate average with replicated rows

Question

To duplicate the rows according to the value in column 'n', and reassign the value in column 'v' with the average (of v divided by n), like below:

I am following the sample at Replicating rows in a pandas data frame by a column value.

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [1, 2, 3],
'v' : [ 10, 13, 8]
})
df2 = df.loc[np.repeat(df.index.values,df.n)]

#pd.__version__ 0.20.3
#np.__version__ 1.15.0

But it returns me an error message:

Traceback (most recent call last):
  File "C:\Python27\Working Scripts\pv.py", line 14, in <module>
df2 = df.loc[np.repeat(df.index.values, df.n)]
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 61, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 41, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

What goes wrong here and how can I correct it? Thank you. (Some others pandas and numpy scripts work all fine in the computer. )

I can't reproduce, it works on my machine. I have pandas 0.23.4, try upgrading it maybe ? — IMCoins, Sep 25 '18 at 08:49
It works for me too. Try `df.reindex(df.index.repeat(df.n))`? — Abhi, Sep 25 '18 at 08:52
@IMCoins, thank you. I upgraded pandas to 0.23.4 and numpy to 1.15.2 but still the same. — Mark K, Sep 25 '18 at 08:56
@Abhi, with upgraded Pandas and Numpy, it's still the same... — Mark K, Sep 25 '18 at 08:57
These are litteraly shots in the dark to me since I can't reproduce. Try doing `df.index.values.astype('int32')` ? — IMCoins, Sep 25 '18 at 09:04
@IMCoins, superb!! the line changes to "df2 = df.loc[np.repeat(df.index.values, df.n.astype('int32'))]", it works. Can you help me with the average question? — Mark K, Sep 25 '18 at 09:05

IMCoins · Accepted Answer · 2018-09-25T09:37:48.893

We usually only answer one question per thread, but you probably didn't know. For the first question, it has been answered in the comments. Casting to int32 explicitly solved your problem.

As for the average question, you can always reassign the values doing...

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [1, 2, 3],
'v' : [ 10, 13, 8]
})
df2 = df.loc[np.repeat(df.index.values,df.n)]
df2.loc[:, 'v'] = df2['v'] / df2['n']

print df2

#   id  n          v
# 0  A  1  10.000000
# 1  B  2   6.500000
# 1  B  2   6.500000
# 2  C  3   2.666667
# 2  C  3   2.666667
# 2  C  3   2.666667

I corrected the line df2['v'] = df2['v'] / df2['n'], with the .loc method which is the best practice when targeting data in pandas.

As stated in the comments, it throws a warning. You can see reading this link that this warning does false positives. As long as you know what you are doing, you should be fine. This warning is here to tell you that the method df.loc[] returns a copy of the DataFrame, and you are not using it... hence the fact that you might be doing things wrong.

tl;dr from the link, you can disable the warning doing :

pd.options.mode.chained_assignment = None # default='warn'

thanks again. great finding with a perfect solution to the question! — Mark K, Sep 25 '18 at 09:20
it also gives little warning on the df2['v'] = df2['v'] / df2['n']. "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead". What is the way to avoid this warning? thank you. — Mark K, Sep 25 '18 at 09:25

Python, Pandas to calculate average with replicated rows

1 Answers1