I am conducting a df.apply(myfunction, args=(df2,x,y,z), axis=1)
The result of myfunction() is a dataframe. But in order for this to work with df.apply() the return object must be a pd.Series.
The dataframe returned by myfunction() has various columns and 6 rows of data for each column.
I convert my df to dict then to series so it can work with df.apply().
The output of the pd.Series(df.to_dict()):
book_sale_date {0: 2016-03-01 00:00:00, 1: 2016-03-01 00:00:0...
countx {0: 17, 1: 31, 2: 92, 3: 12, 4: 92, 5: 92}
dbNUM {0: 93353485.0, 1: 93353485.0, 2: 93353485.0, ...
...
When I convert this structure back to dataframe like so:
pd.DataFrame(df.apply(myfunction, args=(df,x,y,z), axis=1))
The result has the correct columns but only 1 row with the correct data types in the respective columns but all lumped in one row.
For example the book_sale_date
column looks like:
{0: 2016-03-01 00:00:00, 1: 2016-03-01 00:00:0, 2: 2016-03-01 00:00:0, 3: 2016-03-01 00:00:0, 4: 2016-03-01 00:00:0, 5: 2016-03-01 00:00:0}
Here is the output of intermediate_df.to_clipboard()
, which is the df that I want to construct but am forced to turn it into a dict and then to a series to work with .apply()
.
sale_month countx onnfl_cumsum_minmax_c_sum_ratio onnfl_max onnfl_min onnfl_sprd onnfl_sprd_median onnfl_sprd_neg_count onnfl_sprd_neg_sum onnfl_sprd_pos_count onnfl_sprd_pos_neg_sum_ratio onnfl_sprd_pos_sum
0 2016-03-01 17 1.54829687344 117.69 -37.31 100.11 0.588235294118 -54.89 0.176470588235 2.82382947714 155.0
1 2016-03-01 31 1.28473432668 220.14 -8.35 177.85 0.354838709677 -72.39 0.290322580645 3.45683105401 250.24
2 2016-03-01 92 1.21749735751 -860.93 0.478260869565 -1185.49 0.195652173913 0.273777087955 324.56
3 2016-03-01 12 13708.76 -937.27 17069.77 292.365 0.25 -1970.44 0.75 9.66292300197 19040.21
4 2016-03-01 92 1.00115588305 13708.76 389.47 15511.95 1413.72 0.282608695652 -376.35 0.413043478261 42.21681945 15888.3
5 2016-03-01 92 1.03090199741 98.32 -4765.51 -5139.15 -471.96 0.489130434783 -5945.64 0.20652173913 0.135643934042 806.49
Update:
I am experiencing some variant of link
The other question I have is using df.apply() even the right approach if the desired result is a dataframe?
Here is what I am trying to do:
1) I have a dataframe df of 2 columns that has 1 million rows.
2) The 2 columns are names of cities - city1 and city2. Each row is a combination of cities from a large universe of cities.
3) I have another dataframe called df that has daily hourly temperature data for 4000 cities.
4) I want to iterate through each row of df and do a lookup in df2 to extract temperature data for each of the 2 cities and compute various statistics i.e. temp spread during specific hours, sums, averages etc
5) The result object is a dataframe that has 6 rows and about 45 columns of statistics for each city pair
If I run myfunction() for a single row of df by itself by passing in the same arguments as that passed to df.apply() then this works. My question is should I run myfunction() in a for loop for each row of df or df.apply()? Which is faster for 1 million row df.