How to identify string change in a row comparing to previous row in a dataframe?

Question

I have a DataFrame from pandas:

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}]
df = pd.DataFrame(inp)
print (df)

If a change in a row's string value occurs comparing to previous row, I want to identify it in a separate row "Cng-Address", and if row's numeric value changes identify it in "Cng-Year" column. If there is no change identify it as zero.

The index is “Name” meaning that the above calculations should be done for all rows associated to person name. If a “Name” changes (i.e. John to Steve) then calculations for "Cng-Address" and "Cng-Year" should reset. Column year sorted ascending.

As a final report I want to get:

John changed years “1” time and changed locations “2” times
Steve changed years “2” times and change locations “2” times
Total changed addresses for Year 2019 is “2” times

Current Output:

+-------+------+---------------+
| Name  | Year | Address       |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Orange county |
+-------+------+---------------+
| John  | 2019 | New York      |
+-------+------+---------------+
| Steve | 2018 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2020 | California    |
+-------+------+---------------+
| Steve | 2020 | Canada        |
+-------+------+---------------+

Ideal Output:

+-------+------+---------------+----------+-------------+
| Name  | Year | Address       | Cng-Year | Cng-Address |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Beverly hills | 1        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Orange county | 0        | 1           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | New York      | 0        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2018 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 1        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | California    | 1        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | Canada        | 0        | 1           |
+-------+------+---------------+----------+-------------+

Does this answer your question? [Comparing previous row values in Pandas DataFrame](https://stackoverflow.com/questions/41399538/comparing-previous-row-values-in-pandas-dataframe) — AMC, Apr 07 '20 at 21:30
@AMC I have noticed this before posting. The twist in my question is the grouping and re-indexing before the change identification. — AHK, Apr 12 '20 at 19:22

score 3 · Answer 1 · edited Apr 07 '20 at 21:31

You can use pd.shift to compare the row to the previous row:

df["Cng-Year"] = ((df["Year"] != df["Year"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
df["Cng-Address"] = ((df["Address"] != df["Address"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
#df[['Cng-Year','Cng-Address']]=df[['Cng-Year','Cng-Address']].replace(True,1).replace(False,0) OR
#df[['Cng-Year','Cng-Address']] = np.where(df[['Cng-Year','Cng-Address']], 1,0)

score 3 · Accepted Answer · answered Apr 07 '20 at 21:02

3

YOu can do with groupby:

groups = df.groupby('Name')

for col in ['Year', 'Address']:
    df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)

Output:

    Name  Year        Address  cng-Year  cng-Address
0   John  2018  Beverly hills         0            0
1   John  2018  Beverly hills         0            0
2   John  2019  Beverly hills         1            0
3   John  2019  Orange county         0            1
4   John  2019       New York         0            1
5  Steve  2018         Canada         0            0
6  Steve  2019         Canada         1            0
7  Steve  2019         Canada         0            0
8  Steve  2020     California         1            1
9  Steve  2020         Canada         0            1

answered Apr 07 '20 at 21:02

Quang Hoang

146,074
10
56
74

This piece of code worked perfectly for me! Could you comment on how to count the total Address change between "Canada" and "California" for the year "2020"? – AHK Apr 14 '20 at 11:20
1

`groups[col].shift()` shifts the corresponding column by `1` within each name. `fillna(df[col]` fills the first row in each (shifted) group with the original, indicating no change. Finally, `ne(df[col])` compares the shifted values with the original values for changes. – Quang Hoang Apr 14 '20 at 13:20
I have asked a new version of the question including your answer as a partial solution. It would be nice if you could take a look! https://stackoverflow.com/q/61216389/2739921 – AHK Apr 14 '20 at 20:18

score 1 · Answer 3 · answered Apr 07 '20 at 21:00

1

You can use rolling and check if the value is equal to the one above:

df['Cng-Year'] = df.groupby('Name')['Year'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
df['Cng-Address'] = df.groupby('Name')['Address'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))

answered Apr 07 '20 at 21:00

Bruno Mello

4,448
1
9
39

I have tried this but apparently not very efficient for big size data! I have 300 million rows and the code is running for 6 hours and not finished yet. – AHK Apr 09 '20 at 13:45

How to identify string change in a row comparing to previous row in a dataframe?

3 Answers3