1

I would like to add a column to an existing dataframe which shows a count value. The count value should compare a value in a given row versus all rows in another column.

In my example I want to find the number of times a value in the entire 'end_date' column is earlier than current 'start_date' column. Adding the count to the dataframe like so:

        start_date   end_date   count
1       2020-09-2   2020-09-3     1
2       2020-09-6   2020-09-7     3
3       2020-09-4   2020-09-5     2
4       2020-09-1   2020-09-1     0

I have tried

df['count'] = (df[df['end_date']<df['start_date']]).count() 

but this results in the count column being 0 for all rows as the start_date is always less than the end_date within any one row.

ckeiderling
  • 11
  • 1
  • 3

3 Answers3

0

You want count[i] = number of times compare[:] is less than ref[i].
You did count[:] = number of times compare[i] < ref[i]

A straightforward way is to iterate over the rows and calculate individually.

for i, row in df.iterrows():
    df.at[i, 'count'] = (df['end_date'] < row['start_date']).sum()

(df['end_date'] < row[i, 'start_date']) returns a column of True or False depending on whether the condition is satisfied. .sum() takes True values as 1 and False values as 0.

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • This would presumably only work with a integer based index. – ckeiderling Sep 15 '20 at 17:57
  • @ckeiderling The important part of the code is that you iterate over the rows, compare the iteration row with all other rows, and assign the result to the iteration row. If your indexing is different, it's trivial to get the "active ingredient" of the code to fit your requirements. – Pranav Hosangadi Sep 15 '20 at 18:00
  • @ckeiderling, I've edited my answer to use `iterrows()` instead. – Pranav Hosangadi Sep 15 '20 at 18:06
0

You can try with an outer join

counts = (
    pd.merge(
        df[["start_date"]].assign(temp=1),
        df[["end_date"]].assign(temp=1),
        on="temp",
        how="outer",
    )
    .query("start_date>end_date")
    .groupby("start_date")
    .temp.count()
)

df = df.merge(counts, on="start_date", how="left").fillna(0, downcast="infer")

Bruno Carballo
  • 1,156
  • 8
  • 15
0
import pandas as pd
my_dict = {'start_date': ['2020-09-02', '2020-09-06', '2020-09-04', '2020-09-01']}
df = pd.DataFrame.from_dict(my_dict)
df['count'] = 0
for index, row in df.iterrows():
    df.at[index,'count'] = df[df['start_date'] < row['start_date']].count()[1]
print (df)
  • While code-only answers might answer the question, you could significantly improve the quality of your answer by providing context for your code, a reason for why this code works, and some references to documentation for further reading. From [answer]: _"Brevity is acceptable, but fuller explanations are better."_ – Pranav Hosangadi Sep 15 '20 at 21:23