2

For the given dataframe as follows:

  id|address|sell_price|market_price|status|start_date|end_date
  1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
  1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
  2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
  2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
  2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
  3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
  3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
  3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
  3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
  4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
  4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
  4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
  5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
  5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
  5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9

#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|') 

I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:

  1. mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
  2. result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10

The desired output will like this:

   id               address  mean_ratio  result_count
0   1    7552 Atlantic Lane         NaN             0
1   2     888 Foster Street        1.32             1
2   3       5 Pawnee Avenue        1.25             1
3   4  916 W. Mill Pond St.        1.44             3
4   5        68 Henry Drive         NaN             2

I have tried so far:

# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)

In order to filter start_date isin the range of 2019-09 and 2019-10:

L = [pd.Period('2019-09'), pd.Period('2019-10')] 
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]

To filter row status is finished or failed, I use:

mask = df['status'].str.contains('finished|failed')
df[mask]

But I don't know how to use those to get final result. Thanks your help at advance.

shaik moeed
  • 5,300
  • 1
  • 18
  • 54
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • Sorry, I created dataframe with excel then use `pd.read_clipboard()`, I don't how to convert it to code. – ah bon Mar 12 '20 at 05:30
  • I have edited the data, not sure if it's OK for you. – ah bon Mar 12 '20 at 05:39
  • Check [how-to-provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard](https://stackoverflow.com/a/52413247/8353711) or add output of `df.to_clipboard(sep=',', index=False)` – shaik moeed Mar 12 '20 at 05:41
  • I think there is a mistak in your data. The last row for **id** = `2` is missing a column value. – DOOM Mar 12 '20 at 05:42
  • Not sure if I understand you correctly, there are some rows `sell_price ` are `NaN`. – ah bon Mar 12 '20 at 05:45
  • It throw an error `TypeError: parser_f() got an unexpected keyword argument 'index'` with `index = False`. – ah bon Mar 12 '20 at 05:51
  • Can you try `df[['id', 'address', 'sell_price', 'market_price', 'status', 'start_date', 'end_date']].to_clipboard(sep=';', index=False)` – DOOM Mar 12 '20 at 05:58
  • Another issue: `ModuleNotFoundError: No module named 'pandas.io.formats.csvs'`. – ah bon Mar 12 '20 at 06:02
  • what is your pandas version. If **!= 1** Please update if you can. and try again – DOOM Mar 12 '20 at 06:11
  • I have checked again, I find no errors, with excel I filter rows `status` is `finished` and `start_date` isin the range of `2019-09` and `2019-10`, then groupby `id` and calculate means of `ratio` for each `id`. – ah bon Mar 12 '20 at 06:28
  • `start_date` for this `(id,address) = (1,7552 Atlantic Lane)` is `2019/8/2` which is not in the specified range. Why it is considered in `result_count`? – shaik moeed Mar 12 '20 at 06:39
  • it wasn't counted, `mean_ratio` is `NaN` and `result_count` is `0`, if you want drop it in final result, it will be OK. – ah bon Mar 12 '20 at 06:43
  • Btw, I update pandas and `df[['id', 'address', 'sell_price', 'market_price', 'status', 'start_date', 'end_date']].to_clipboard(sep=';', index=False)` works, but `df.to_clipboard(sep=',', index=False)` still give error previously mentioned. What's the next step for `to_clipboard`? @DOOM – ah bon Mar 12 '20 at 06:45

2 Answers2

2

I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:

df2 = df[['id','address']].drop_duplicates()
print (df2)
    id               address
0    1    7552 Atlantic Lane
2    2     888 Foster Street
5    3       5 Pawnee Avenue
9    4  916 W. Mill Pond St.
12   5        68 Henry Drive

df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')] 
c = ['start_date']

mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])

df = df[mask1 & mask]

df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
                                        result_count=('ratio','size'))

df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
    id               address  mean_ratio  result_count
0    1    7552 Atlantic Lane         NaN           0.0
2    2     888 Foster Street    1.320000           1.0
5    3       5 Pawnee Avenue    1.250000           1.0
9    4  916 W. Mill Pond St.    1.436667           3.0
12   5        68 Henry Drive         NaN           2.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Very awsome, bravo. – ah bon Mar 12 '20 at 06:48
  • Btw, how do you deal with data in my question with index? – ah bon Mar 12 '20 at 06:50
  • @ahbon - Do you think `0,2,5,9,12` ? – jezrael Mar 12 '20 at 06:52
  • @ahbon - If need default only change `df2 = df[['id','address']].drop_duplicates()` to `df2 = df[['id','address']].drop_duplicates().reset_index(drop=True)` – jezrael Mar 12 '20 at 06:53
  • No, i wonder if you have problem for reading the data I pasted from `read_clipboard`, I try to use `pd.read_clipboard(sep=',', index=False)` but it raise an error. – ah bon Mar 12 '20 at 06:56
  • @ahbon - hmmm, I use notepad, there was some data cleaning - all columns in one row (data copied from last column after `start_date`) – jezrael Mar 12 '20 at 06:58
1

Some helpers

def mean_ratio(idf):
    # filtering data
    idf = idf[
              (idf['start_date'].between('2019-09-01', '2019-10-31')) & 
              (idf['mean_ratio'].notnull()) ]
    return np.round(idf['mean_ratio'].mean(), 2)

def result_count(idf):
    idf = idf[
              (idf['status'].isin(['finished', 'failed'])) & 
              (idf['start_date'].between('2019-09-01', '2019-10-31')) ]
    return idf.shape[0]


# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']

df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})

# Group the df
g =  df.groupby(['id', 'address'])

mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')

# Final result
pd.concat((mean_ratio, result_count), axis=1)
ah bon
  • 9,293
  • 12
  • 65
  • 148
DOOM
  • 1,170
  • 6
  • 20
  • Thank u, In your second function, why do you use `df['start_date'].between('2019-09-01', '2019-10-31')` instead of `idf['start_date'].between('2019-09-01', '2019-10-31')`? Is it a typo error? – ah bon Mar 12 '20 at 09:01
  • @ahbon That is a **typo** :). I have updated the ans – DOOM Mar 12 '20 at 09:03
  • I have tested both of your methods, I get the same results. Thanks a lot. – ah bon Mar 12 '20 at 10:15