How can I get below - Input - Input
and Output - Output
I tried to use shift but I am not getting the result
df1['Value_shift_down'] = df1['Value'].shift(-1)
df1['Value_shift_up'] = df1['Value'].shift(1)
Determinging the lowest and higest value based on a string is not possible (because "10" is smaller then "2"). Therefore you need to replace your strings by integers.
So one solution might be to remove the unit names and convert the unit name numbers to integers. Then you can determine the min and max values by a pivot table.
df['Unit'] = df['Unit'].str.replace('Unit ', '', regex=False).astype(int)
df.pivot_table(index='Value', values='Unit', aggfunc=['min', 'max'])
If neccessary, you can convert afterwards the values back to string and add the unit name.
You were on the right track with pandas .shift()
. Instead of converting back and forth between strings and integers, you can leverage shift
to compare each value with the previous, then use a cumulative sum (.cumsum()
) to detect when the value changes. The final element is the DataFrame .groupby()
method, which allows us to collect groups of same-Value
rows together and get the start and end Unit
s for that group. The code to do this goes something like this:
result_groups = []
for _, group in df1.groupby((df1["Value"] != df1["Value"].shift()).cumsum()):
start_unit = group.loc[group.index.min(), "Unit"]
end_unit = group.loc[group.index.max(), "Unit"]
value = group["Value"].unique()[0]
result_groups.append({"Start_Unit": start_unit, "End_Unit": end_unit, "Value": value})
results = pd.DataFrame(result_groups)
The results
DataFrame looks like this, as desired:
Start_Unit | End_Unit | Value | |
---|---|---|---|
0 | Unit 1 | Unit 1 | 2 |
1 | Unit 2 | Unit 2 | 4 |
2 | Unit 3 | Unit 3 | 1 |
3 | Unit 4 | Unit 5 | 5 |
4 | Unit 6 | Unit 6 | 7 |
5 | Unit 7 | Unit 7 | 6 |
6 | Unit 8 | Unit 8 | 9 |
7 | Unit 9 | Unit 20 | 0 |
Note: This solution does not require uniqueness of the values in the 'Value'
column on a group-by-group basis (i.e., if there was another sequence of 5
s later, it would correctly recognize it as a new group). If you know ahead of time that this will never happen, you can take a simpler approach, using just df1.groupby("Value")
instead and doing something similar inside the loop.