Pandas - Compare single row with all rows in same dataframe

Question

I would like to add a column to an existing dataframe which shows a count value. The count value should compare a value in a given row versus all rows in another column.

In my example I want to find the number of times a value in the entire 'end_date' column is earlier than current 'start_date' column. Adding the count to the dataframe like so:

        start_date   end_date   count
1       2020-09-2   2020-09-3     1
2       2020-09-6   2020-09-7     3
3       2020-09-4   2020-09-5     2
4       2020-09-1   2020-09-1     0

I have tried

df['count'] = (df[df['end_date']<df['start_date']]).count()

but this results in the count column being 0 for all rows as the start_date is always less than the end_date within any one row.

I have updated the example to better represent my actual problem — ckeiderling, Sep 15 '20 at 17:23
I think this [link](https://stackoverflow.com/questions/35459316/compare-each-row-with-all-rows-in-data-frame-and-save-results-in-list-for-each-r) answers your question — Anurag Reddy, Sep 15 '20 at 17:34
@aerijman, this gives the same problem as proposed in the question using count() — ckeiderling, Sep 15 '20 at 18:02
Now I see, then do this within a loop on `itertuples()` or just iterating over the `index`. Ping me if you want me to post the answer. — aerijman, Sep 15 '20 at 18:29

Pranav Hosangadi · Answer 1 · 2020-09-15T18:05:55.560

0

You want count[i] = number of times compare[:] is less than ref[i].
You did count[:] = number of times compare[i] < ref[i]

A straightforward way is to iterate over the rows and calculate individually.

for i, row in df.iterrows():
    df.at[i, 'count'] = (df['end_date'] < row['start_date']).sum()

(df['end_date'] < row[i, 'start_date']) returns a column of True or False depending on whether the condition is satisfied. .sum() takes True values as 1 and False values as 0.

edited Sep 15 '20 at 18:05

answered Sep 15 '20 at 17:27

Pranav Hosangadi

23,755
7
44
70

This would presumably only work with a integer based index. – ckeiderling Sep 15 '20 at 17:57
@ckeiderling The important part of the code is that you iterate over the rows, compare the iteration row with all other rows, and assign the result to the iteration row. If your indexing is different, it's trivial to get the "active ingredient" of the code to fit your requirements. – Pranav Hosangadi Sep 15 '20 at 18:00
@ckeiderling, I've edited my answer to use `iterrows()` instead. – Pranav Hosangadi Sep 15 '20 at 18:06

Bruno Carballo · Answer 2 · 2020-09-15T20:13:55.087

0

You can try with an outer join

counts = (
    pd.merge(
        df[["start_date"]].assign(temp=1),
        df[["end_date"]].assign(temp=1),
        on="temp",
        how="outer",
    )
    .query("start_date>end_date")
    .groupby("start_date")
    .temp.count()
)

df = df.merge(counts, on="start_date", how="left").fillna(0, downcast="infer")

edited Sep 15 '20 at 20:13

answered Sep 15 '20 at 17:36

Bruno Carballo

1,156
8
15

This works best so far, it's still quite a long solution to what seems like a simple problem. – ckeiderling Sep 15 '20 at 18:01
I've made it shorter, not sure how to reduce it further – Bruno Carballo Sep 15 '20 at 20:15

score 0 · Answer 3 · answered Sep 15 '20 at 18:01

0

import pandas as pd
my_dict = {'start_date': ['2020-09-02', '2020-09-06', '2020-09-04', '2020-09-01']}
df = pd.DataFrame.from_dict(my_dict)
df['count'] = 0
for index, row in df.iterrows():
    df.at[index,'count'] = df[df['start_date'] < row['start_date']].count()[1]
print (df)

answered Sep 15 '20 at 18:01

Fabricio Fonseca

139
5

While code-only answers might answer the question, you could significantly improve the quality of your answer by providing context for your code, a reason for why this code works, and some references to documentation for further reading. From [answer]: _"Brevity is acceptable, but fuller explanations are better."_ – Pranav Hosangadi Sep 15 '20 at 21:23

Pandas - Compare single row with all rows in same dataframe

3 Answers3