Some background: MB column only consists of 1 of 2 values (M or B) while the Indent column contains int. The numbers don't necessarily follow a pattern but if it does increment, it will increment by one. The numbers can decrement by any amount. The rows are sorted in a specific order.
The goal here is to drop rows with Indent values higher than the indent value of a row that contains a "B" value in the MB column. This should only stop once the indent value is equal to or less than the row that contains the "B" value. Below is a chart demonstrating what rows should be dropped.
Sample data:
import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)
Code:
My current code has issues where I cant drop the rows of the inner for loop since it isn't using iterrows. I am aware of dropping based on a conditional expression but I am unsure how to nest this correctly.
for index, row in df.iterrows():
for row in range(index-1,0,-1):
if df.loc[row].at["INDENT"] <= df.loc[index].at["INDENT"]-1:
if df.loc[row].at["MB"]=="B":
df.drop(df.index[index], inplace=True)
break
else:
break
Edit 1:
This problem can be represented graphically. This is effectively scanning a hierarchy for an attribute and deleting anything below it. The example I provided is bad since all rows that need to be dropped are simply indent 3 or higher but this can happen at any indent level.
Edit 2: We are going to cheat on this problem a bit. I won't have to generate an edge graph from scratch since I have the prerequisite data to do this. I have an updated table and sample data.
Updated Sample Data
import pandas as pd
d = {
'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3},
'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
}
df = pd.DataFrame(d)
Updated Code
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
d = {
'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3},
'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
}
df = pd.DataFrame(d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'a', 'c', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source=-1).reverse()
print([x for x in T])
nx.draw(G, with_labels=True)
plt.show()
I am unsure how to use the edges from here to identify the rows that need to be dropped from the dataframe