1

Some background: MB column only consists of 1 of 2 values (M or B) while the Indent column contains int. The numbers don't necessarily follow a pattern but if it does increment, it will increment by one. The numbers can decrement by any amount. The rows are sorted in a specific order.

The goal here is to drop rows with Indent values higher than the indent value of a row that contains a "B" value in the MB column. This should only stop once the indent value is equal to or less than the row that contains the "B" value. Below is a chart demonstrating what rows should be dropped.

Example

Sample data:

import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)

Code:

My current code has issues where I cant drop the rows of the inner for loop since it isn't using iterrows. I am aware of dropping based on a conditional expression but I am unsure how to nest this correctly.

for index, row in df.iterrows():
    for row in range(index-1,0,-1):
        if df.loc[row].at["INDENT"] <= df.loc[index].at["INDENT"]-1:
            if df.loc[row].at["MB"]=="B":
                df.drop(df.index[index], inplace=True)
                break
            else:
                break

Edit 1: This problem can be represented graphically. This is effectively scanning a hierarchy for an attribute and deleting anything below it. The example I provided is bad since all rows that need to be dropped are simply indent 3 or higher but this can happen at any indent level. Hierarchy

Edit 2: We are going to cheat on this problem a bit. I won't have to generate an edge graph from scratch since I have the prerequisite data to do this. I have an updated table and sample data. Updated Table

Updated Sample Data

import pandas as pd
d = {
    'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 
    'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
    'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
    'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
    }
df = pd.DataFrame(d)

Updated Code

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
d = {
    'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 
    'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
    'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
    'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
    }
df = pd.DataFrame(d)

G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'a', 'c', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source=-1).reverse()
print([x for x in T])
nx.draw(G, with_labels=True)
plt.show()

I am unsure how to use the edges from here to identify the rows that need to be dropped from the dataframe

onyex
  • 37
  • 6
  • why index 4 is dropped but not 3? and please provide copy pastable sample data, e.g. df.to_dict() – Andreas Sep 07 '21 at 22:23
  • The iteration would be once you get to index 1, you get a B. Going to index 2, there is an indent of 1 which matches the indent value of the "B" on index 1. Thus there isn't anything we would drop as a result of index 1. When We get to index 3, we get to another "B". This time, there is an indent value that is less than the indent at index 3. {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}} – onyex Sep 07 '21 at 23:03
  • thanks for the dictionary, but the B at index 3 is greater than the B at index 1 as well, why isn't that dropped? Because there is an M in between? – Andreas Sep 07 '21 at 23:21
  • The indent at index 2 is less than or equal to the indent at index 1. This "resets" the conditional statement looking for indent values that are greater than the row with an MB of "B". This is why I need break statements in the inner for loop. – onyex Sep 07 '21 at 23:58
  • Thank you for elaborating and adding a diagramm, I still think there is a problem, logic wise. Its unforunately not clear why index 4 has to be removed. Because index 3 increases by one over index 1, so that is fine, but index 4 also increases by 1 over index 3 but this is not fine? – Andreas Sep 08 '21 at 14:45
  • Starting from index 0, we are going to go down the MB column until we hit a "B" value. We should then look further down the chain to see if there are rows in that flow. If there are, they should be deleted from the df. We can only look down the chain by identifying the indent the original item was at. To this end, if the next item has an indent less than or equal to the indent of the line with the "B", we know that the row is not in the same flow as the line with the "B". Repeat ad nauseam. Index 4 is being removed because it is "in line" with index 3, which has a "B" value. – onyex Sep 08 '21 at 17:11

2 Answers2

1

Not a answer, but to long for a comment:

import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)

df['i'] = df['INDENT']+1
df = df.reset_index()
df = df.merge(df[['INDENT', 'index', 'MB']].rename(columns={'INDENT':'target', 'index':'ix', 'MB': 'MBt'}), left_on=['i'], right_on=['target'], how='left')

import networkx as nx
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'index', 'ix', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source='0').reverse()
print([x for x in T])
nx.draw(G, with_labels=True)

This demonstrates the problem. You actually want to apply graph theory, the library networkx can help you with that. A first step would be to first construct the connection between each node, like I did in the example above. From there you can try to apply a logic to filter edges you don't want.

Andreas
  • 8,694
  • 3
  • 14
  • 38
  • This is good information. I need to do a bunch of research on this since it appears that I could generate an edge list with the given data. In the mean time, I lucked out and was provided with an edge list. Even given an edge list, I don't see a way to identify the rows that need to be dropped. – onyex Sep 09 '21 at 19:58
0

Not sure I fully understand the question you're asking however this is my attempt. It selects only if index > indent and mb ==B. In this case I'm just selecting the subset we want instead of dropping the subset we don't

import numpy as np
import pandas as pd
x=np.transpose(np.array([[0,1,2,3,4,5,6,7,8],[0,1,1,2,3,3,4,2,3],['M','B','M','B','B','M','M','B','M']]))
df=pd.DataFrame(x,columns=['Index','indent','MB'])

df1=df[(df['Index']>=df['indent']) & (df['MB']=='B')]

print(df1)
tjaqu787
  • 295
  • 2
  • 16