1

I'm using this dataset to solve a problem ->https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.

I used this code to see repeated values in the column "id": pd.concat(g for _, g in df.groupby("id") if len(g) > 1) and the output is this:

output

what i'm want to do is to compare repeated values, e.g: id 1000102 appear two times, that means it was sold more than once in the range. What I would like to do is take these two entries and compare the sale value in the "price" column to find the biggest variances, but with all the entries. For example sorting from highest to lowest variation.

  • Please show the code as test, not images. Thanks – DialFrost Aug 14 '22 at 05:16
  • Welcome to Stack Overflow! Please, see [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/14627505), and [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) – Vladimir Fokow Aug 15 '22 at 03:35
  • What would you like your output to be? The full dataframe with [values in groups sorted](https://stackoverflow.com/q/27842613/14627505)? Or the maximum values (or [minimum](https://stackoverflow.com/q/51074911/14627505)) of each group ([that has length >1](https://stackoverflow.com/a/17945528/14627505))? Please provide an example output that you're after – Vladimir Fokow Aug 15 '22 at 03:42
  • @IsmaelPauchner, no worries - that is exactly why we are giving you advice and not downvoting - with the hope that you learn :) – Vladimir Fokow Aug 15 '22 at 15:28
  • Sorry for the errors, it's my first post. @VladimirFokow As I would like to know if the houses made a profit or loss being sold more than once in the period, what I would like is to make the id's second appearance - its first appearance. For example: with id 1000102, row 2497 in the price column (second appearance) with a value of 300000.0 - row 2496 (first appearance) with a value of 280000.0. The output would be 20000.0. – Ismael Pauchner Aug 15 '22 at 15:29
  • @DialFrost[link](https://colab.research.google.com/drive/1JdxD81fohYT4n6lLI9m7Z3RZLJmU850M?usp=sharing) here's the code that I used. – Ismael Pauchner Aug 15 '22 at 18:00
  • How do you want [rows with 'id' 17602-17604](https://imgur.com/a/0CrpKsv) be handled? This is the only group with 3 elements in it. Also: generally, do you want to get just this difference for each group? – Vladimir Fokow Aug 15 '22 at 20:49

1 Answers1

1

In groups by 'id': from the LAST element subtract the FIRST one. This will be the diff value that corresponds to the 'id' of the group.

diffs = df_2.groupby('id')['price'].agg(lambda x: x.iloc[-1] - x.iloc[0] )
diffs.name = 'diff'

You can aggregate with any function here! For example: lambda x: x.max() - x.min().

If you'd like to join the diffs Series as a column to the df:

df.iloc[:, :3].merge(diffs, left_on='id', right_index=True)

Other

This question provides a better version of leaving only groups with more than 1 element:

df_2 = df.groupby('id').filter(lambda x: len(x) > 1)
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27