0

I have a dataframe extracted from excel sheet.

I am looking for NOT legit rows.

A legit row is such that it meets ANY of the following conditions:

  1. exactly 1 column filled in but the other columns are empty or null
  2. exactly 2 columns are filled in but the other columns are empty or null
  3. exactly all 8 columns are filled in

SO a NON legit row is the opposite of the above such as:

  1. 7 of the 8 columns are filled in but one is empty
  2. 6 of the 8 columns are filled in but any of the two is empty and so on...

The 8 columns i am interested in are: columns A, B, D, E, F, G, I, L.

I only want to return those rows that are NOT legit.

I know how to find rows which are empty in specific columns but not sure how to find the non legit rows based on the above conditions.

    empty_A = sheet[sheet[sheet.columns[0]].isnull()]
    empty_B = sheet[sheet[sheet.columns[1]].isnull()]
    empty_D = sheet[sheet[sheet.columns[3]].isnull()]
    empty_E = sheet[sheet[sheet.columns[4]].isnull()]
    empty_F = sheet[sheet[sheet.columns[5]].isnull()]
    empty_G = sheet[sheet[sheet.columns[6]].isnull()]
    empty_I = sheet[sheet[sheet.columns[8]].isnull()]
    empty_L = sheet[sheet[sheet.columns[11]].isnull()]
    print(empty_G)

UPDATE:

I solved using list comprehension

Kim Stacks
  • 10,202
  • 35
  • 151
  • 282
  • 1
    can you post a snippet our your data (as text into the question) – moys Sep 09 '19 at 04:04
  • the data is a bit sensitive, but is there something that will help solve the question – Kim Stacks Sep 09 '19 at 04:06
  • 1
    it does not have to be actual data, you can create a fake to represent your data. IIUC, you want the rows where 5 columns are empty or null? is that correct? and do they have to be specific columns? – moys Sep 09 '19 at 04:08
  • i have added clarification in question. Basically I want rows where any 7 out of 8 cols are non-empty, OR 6/8, OR 5/8, OR 4/8, OR 3/8 – Kim Stacks Sep 09 '19 at 04:11

3 Answers3

0
df.loc[(df.isna().sum(axis=1)==0) | (df.isna().sum(axis=1)==7) | (df.isna().sum(axis=1)==6)]
moys
  • 7,747
  • 2
  • 11
  • 42
  • no. but it's fine. I found list comprehension and it helps. https://stackoverflow.com/a/55557758/80353 – Kim Stacks Sep 09 '19 at 04:47
  • Check the updated answer. I think this is what you need. From what i know 'list comprehension` is slower. You can use this if you like. – moys Sep 09 '19 at 04:57
  • Your answer did not help me legitimise Rows with 1,2 or all 8 columns filled. So no. I’m blunt sorry – Kim Stacks Sep 09 '19 at 05:01
  • No problem. We are all helping each other & learning here :-). It was not clear what was the expected output. I have made one last try. Try to check it out or feel free to ignore. – moys Sep 09 '19 at 05:23
0

If you already have populated your dataframe then you can do it like this

import numpy as np
import pandas as pd

## Generate Random Data
raw_data=np.random.choice([None,1], (50,8))
raw_data= np.r_[raw_data, np.random.choice([None, 1,2,3], (50,8))]

## Create dataframe from random data
df = pd.DataFrame(raw_data, columns="A, B, D, E, F, G, I, L".split(","))
notnull_counts = (~df.isnull()).sum(axis=1)

## filter rows with your condition
legit_rows = df[((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]
non_legit_rows = df[~((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]

display(legit_rows)

Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
0

It seems like you want to count the number of null values in these 8 particular columns and select rows based on how many nulls are found. That phrasing suggests summing and selecting based on that sum. Most pandas operations default to performing columnwise operations, so you need to tell sum() to perform the sum for each row by using axis="columns", like so:

# This is a series indexed like df.
# It counts the number of null values in the given columns.
n_null = df[["A", "B", "D", "E", "F", "G", "I", "L"]].isnull().sum(axis="columns")

# This selects the rows where n_null has certain values.
df_notlegit = df.loc[n_null.isin([8, 5, 4, 3, 2])]

# This is another way to do it.
df_nonlegit = df.loc[(n_null > 1) & (n_null < 9)]
PaSTE
  • 4,050
  • 18
  • 26