0

I want to automate a data validation process using Pandas. This involves two steps:

The first is making sure that each row has a valid value for a column. For example, I want to check that in the column 'Independent' that row has either a 1 or a 0, and that it throws a flag if it does not. So if Independent==1 or Independent==0 IndepFlag=0 else IndepFlag=1 (For that row). Based on this flag, I want to print that Row's ID to an excel file.

The second is doing something similar, but checking multiple columns (e.g. if Independent==1 and Column2>1 Column2Flag=1.

My question is, how do I check for a valid value and then capture and print from the column that has the row's ID, and how do I do this with multiple columns?

lbug
  • 361
  • 1
  • 3
  • 16

1 Answers1

1

You can accomplish this using where: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

In [22]: df = pd.DataFrame(dict(Independent=[-1,0,1,4,0], Column2=[0,1,0,2,2]))

In [23]: df
Out[23]:
   Column2  Independent
0        0           -1
1        1            0
2        0            1
3        2            4
4        2            0

where converts values that don't meet your criteria to nan. You can use notnull() to get a boolean Series and change it to ints if you like using astype:

In [24]: df["IndepFlag"] = df.Independent.where((df.Independent == 0) | (df.Independent == 1)).notnull().astype(int)

In [25]: df
Out[25]:
   Column2  Independent  IndepFlag
0        0           -1          0
1        1            0          1
2        0            1          1
3        2            4          0
4        2            0          1

Then you can select only the rows that interest you and write them to an excel file:

In [26]: flagged = df[df.IndepFlag == 1]

In [27]: flagged
Out[27]:
   Column2  Independent  IndepFlag
1        1            0          1
2        0            1          1
4        2            0          1

In [28]: flagged.to_excel("flagged.xlsx")

Using multiple columns is not very different, just change where to use (df.Independent == 0) & (df.Column2 > 1).

Brian from QuantRocket
  • 5,268
  • 1
  • 21
  • 18
  • Brian this works great! Just two quick questions: What if I want to append to an existing excel file instead of writing to an entirely new file? Also, what if I just want to write the IndepFlag column, and a column containing the row id into the existing excel file? Thanks! – lbug Feb 20 '15 at 18:23
  • 1
    To only write the index and IndepFlag, use `flagged.IndepFlag.to_excel("flagged.xlsx")`. Not sure how to append to an existing excel file: one approach would be to read the existing file into a dataframe, concat the new dataframe using `pd.concat`, and write to a new excel file. Or maybe this post can help: http://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting-data – Brian from QuantRocket Feb 20 '15 at 19:29
  • 1
    Correction to my previous comment: `Series` don't have a `to_excel` method, but this will work instead: `pd.DataFrame(flagged, columns=["IndepFlag"]).to_excel("flagged.xlsx")` – Brian from QuantRocket Feb 20 '15 at 21:45