2

I have a DataFrame from pandas:

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}]
df = pd.DataFrame(inp)
print (df)

If a change in a row's string value occurs comparing to previous row, I want to identify it in a separate row "Cng-Address", and if row's numeric value changes identify it in "Cng-Year" column. If there is no change identify it as zero.

The index is “Name” meaning that the above calculations should be done for all rows associated to person name. If a “Name” changes (i.e. John to Steve) then calculations for "Cng-Address" and "Cng-Year" should reset. Column year sorted ascending.

As a final report I want to get:

  • John changed years “1” time and changed locations “2” times
  • Steve changed years “2” times and change locations “2” times
  • Total changed addresses for Year 2019 is “2” times

Current Output:

+-------+------+---------------+
| Name  | Year | Address       |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Orange county |
+-------+------+---------------+
| John  | 2019 | New York      |
+-------+------+---------------+
| Steve | 2018 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2020 | California    |
+-------+------+---------------+
| Steve | 2020 | Canada        |
+-------+------+---------------+

Ideal Output:

+-------+------+---------------+----------+-------------+
| Name  | Year | Address       | Cng-Year | Cng-Address |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Beverly hills | 1        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Orange county | 0        | 1           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | New York      | 0        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2018 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 1        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | California    | 1        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | Canada        | 0        | 1           |
+-------+------+---------------+----------+-------------+
AHK
  • 85
  • 1
  • 10
  • 1
    Does this answer your question? [Comparing previous row values in Pandas DataFrame](https://stackoverflow.com/questions/41399538/comparing-previous-row-values-in-pandas-dataframe) – AMC Apr 07 '20 at 21:30
  • @AMC I have noticed this before posting. The twist in my question is the grouping and re-indexing before the change identification. – AHK Apr 12 '20 at 19:22

3 Answers3

3

You can use pd.shift to compare the row to the previous row:

df["Cng-Year"] = ((df["Year"] != df["Year"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
df["Cng-Address"] = ((df["Address"] != df["Address"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
#df[['Cng-Year','Cng-Address']]=df[['Cng-Year','Cng-Address']].replace(True,1).replace(False,0) OR
#df[['Cng-Year','Cng-Address']] = np.where(df[['Cng-Year','Cng-Address']], 1,0)
wwnde
  • 26,119
  • 6
  • 18
  • 32
Daniel Geffen
  • 1,777
  • 1
  • 11
  • 16
3

YOu can do with groupby:

groups = df.groupby('Name')

for col in ['Year', 'Address']:
    df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)

Output:

    Name  Year        Address  cng-Year  cng-Address
0   John  2018  Beverly hills         0            0
1   John  2018  Beverly hills         0            0
2   John  2019  Beverly hills         1            0
3   John  2019  Orange county         0            1
4   John  2019       New York         0            1
5  Steve  2018         Canada         0            0
6  Steve  2019         Canada         1            0
7  Steve  2019         Canada         0            0
8  Steve  2020     California         1            1
9  Steve  2020         Canada         0            1
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • This piece of code worked perfectly for me! Could you comment on how to count the total Address change between "Canada" and "California" for the year "2020"? – AHK Apr 14 '20 at 11:20
  • 1
    `groups[col].shift()` shifts the corresponding column by `1` within each name. `fillna(df[col]` fills the first row in each (shifted) group with the original, indicating no change. Finally, `ne(df[col])` compares the shifted values with the original values for changes. – Quang Hoang Apr 14 '20 at 13:20
  • I have asked a new version of the question including your answer as a partial solution. It would be nice if you could take a look! https://stackoverflow.com/q/61216389/2739921 – AHK Apr 14 '20 at 20:18
1

You can use rolling and check if the value is equal to the one above:

df['Cng-Year'] = df.groupby('Name')['Year'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
df['Cng-Address'] = df.groupby('Name')['Address'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
Bruno Mello
  • 4,448
  • 1
  • 9
  • 39
  • I have tried this but apparently not very efficient for big size data! I have 300 million rows and the code is running for 6 hours and not finished yet. – AHK Apr 09 '20 at 13:45