1

I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value: - for columns with string values we sample a value from a column in original table - for columns with floats or ints we find mean value

Here is my code

def rows_aggr(df, num):
    dataframe = None
    for i in range(0, num):
        row = None
        for cname in df.columns.values:
            column = df[cname]
            dfcol = Series.to_frame(column)

            if column.dtype != np.number:
                item = dfcol.sample(n=1)
            else:
                item = dfcol.mean(axis=1)

            if row is None:
                row = item
            else:
                row = pd.concat([row, item], axis=1)

        if dataframe is None:
            dataframe = row
        else:
            dataframe = pd.concat([dataframe, row], axis=0)

    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5
f, 2.5
b, 2.5

assuming and c, f, b were randomly picked

Thank you!

YohanRoth
  • 3,153
  • 4
  • 31
  • 58

1 Answers1

1

One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
                   'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
  col1 col2  col3       col4
0    a    i     0   6.666667
1    b    j     1   7.333333
2    c    k     2   8.000000
3    d    l     3   8.666667
4    e    m     4   9.333333
5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):
    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
    return pd.DataFrame({col: df[col].sample(num).values
                              if col in list_col_notnumeric  
                              else df[col].mean() 
                         for col in df.columns})

print (rows_aggr(df, 3))
  col1 col2  col3      col4
0    d    i   2.5  8.333333
1    a    n   2.5  8.333333
2    c    j   2.5  8.333333
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index" – YohanRoth Nov 16 '18 at 02:54
  • @YohanRoth of what I read about [this error](https://stackoverflow.com/a/52800065/9274732), it seems that adding the parameter `index=range(num)` in the `pd.DataFrame` may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than `str`/`float`/`int`) it may be why too. – Ben.T Nov 16 '18 at 04:16