0

I am conducting a df.apply(myfunction, args=(df2,x,y,z), axis=1)

The result of myfunction() is a dataframe. But in order for this to work with df.apply() the return object must be a pd.Series.

The dataframe returned by myfunction() has various columns and 6 rows of data for each column.

I convert my df to dict then to series so it can work with df.apply().

The output of the pd.Series(df.to_dict()):

    book_sale_date                      {0: 2016-03-01 00:00:00, 1: 2016-03-01 00:00:0...
    countx                                    {0: 17, 1: 31, 2: 92, 3: 12, 4: 92, 5: 92}
    dbNUM                             {0: 93353485.0, 1: 93353485.0, 2: 93353485.0, ...
    ...

When I convert this structure back to dataframe like so:

pd.DataFrame(df.apply(myfunction, args=(df,x,y,z), axis=1))

The result has the correct columns but only 1 row with the correct data types in the respective columns but all lumped in one row.

For example the book_sale_date column looks like:

{0: 2016-03-01 00:00:00, 1: 2016-03-01 00:00:0, 2: 2016-03-01 00:00:0, 3: 2016-03-01 00:00:0, 4: 2016-03-01 00:00:0, 5: 2016-03-01 00:00:0}

Here is the output of intermediate_df.to_clipboard(), which is the df that I want to construct but am forced to turn it into a dict and then to a series to work with .apply().

    sale_month  countx  onnfl_cumsum_minmax_c_sum_ratio onnfl_max   onnfl_min   onnfl_sprd  onnfl_sprd_median   onnfl_sprd_neg_count    onnfl_sprd_neg_sum  onnfl_sprd_pos_count    onnfl_sprd_pos_neg_sum_ratio    onnfl_sprd_pos_sum
0   2016-03-01  17  1.54829687344   117.69  -37.31  100.11      0.588235294118  -54.89  0.176470588235  2.82382947714   155.0
1   2016-03-01  31  1.28473432668   220.14  -8.35   177.85      0.354838709677  -72.39  0.290322580645  3.45683105401   250.24
2   2016-03-01  92  1.21749735751           -860.93     0.478260869565  -1185.49    0.195652173913  0.273777087955  324.56
3   2016-03-01  12      13708.76    -937.27 17069.77    292.365 0.25    -1970.44    0.75    9.66292300197   19040.21
4   2016-03-01  92  1.00115588305   13708.76    389.47  15511.95    1413.72 0.282608695652  -376.35 0.413043478261  42.21681945 15888.3
5   2016-03-01  92  1.03090199741   98.32   -4765.51    -5139.15    -471.96 0.489130434783  -5945.64    0.20652173913   0.135643934042  806.49

Update:

I am experiencing some variant of link

The other question I have is using df.apply() even the right approach if the desired result is a dataframe?

Here is what I am trying to do:

1) I have a dataframe df of 2 columns that has 1 million rows.

2) The 2 columns are names of cities - city1 and city2. Each row is a combination of cities from a large universe of cities.

3) I have another dataframe called df that has daily hourly temperature data for 4000 cities.

4) I want to iterate through each row of df and do a lookup in df2 to extract temperature data for each of the 2 cities and compute various statistics i.e. temp spread during specific hours, sums, averages etc

5) The result object is a dataframe that has 6 rows and about 45 columns of statistics for each city pair

If I run myfunction() for a single row of df by itself by passing in the same arguments as that passed to df.apply() then this works. My question is should I run myfunction() in a for loop for each row of df or df.apply()? Which is faster for 1 million row df.

Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126

1 Answers1

0

The way I got what I wanted to work was by doing the following:

1) change myfunction() function to return pd.Series(intermediate_df.unstack())

So unstack the desired dataframe before turning it into a Series object

2) Change my call to df.apply() to:

df.apply(myfunction, args=(df2,x,y,z), axis=1).stack().reset_index(drop=True)

I followed the direction given by the following link on how to create a Series objective from a DataFrame.

Perhaps Pandas documentation wants to add examples to describe how to do things like this.

Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126