1

Un-processed data looks like this:

data = "i,a,b\ngood,1,2\nbad,3,a"
df = pd.read_csv(StringIO(data))

  i    a b
--------------
0 good  1 2
1 bad   3 a

Rows are correctly skipped with defualt index:

pd.read_csv(StringIO(data), skiprows=lambda index: 2 == index)

    i    a  b
0   good 1  2

But, when I set my own index with index_col, this does not seem to work ( does not skip the row).

pd.read_csv(StringIO(data), index_col='i', skiprows=lambda index: 'bad' == index)
DurgaDatta
  • 3,952
  • 4
  • 27
  • 34

1 Answers1

1

It not working, because pandas in skiprows omit rows by positions:

data = "i,a,b\ngood,1,2\nbad,3,a\nbad,a,b\ngood,1,2\nbad,3,a"

df = pd.read_csv(StringIO(data))
print (df)
      i  a  b
0  good  1  2
1   bad  3  a
2   bad  a  b
3  good  1  2
4   bad  3  a

df = pd.read_csv(StringIO(data),skiprows=lambda index: 2 == index)
print (df)
      i  a  b
0  good  1  2
1   bad  a  b
2  good  1  2
3   bad  3  a

df = pd.read_csv(StringIO(data),index_col='i', skiprows=lambda index: 2 == index)
print (df)
      a  b
i         
good  1  2
bad   a  b
good  1  2
bad   3  a

What is shorter way:

df = pd.read_csv(StringIO(data),skiprows=[2])
print (df)
      i  a  b
0  good  1  2
1   bad  a  b
2  good  1  2
3   bad  3  a

But if want remove index by name:

df = pd.read_csv(StringIO(data),index_col='i', skiprows=['bad'])
print (df)

TypeError: an integer is required

Not working, no raise error:

df = pd.read_csv(StringIO(data),index_col='i', skiprows=lambda index: 'bad' == index)
print (df)
      a  b
i         
good  1  2
bad   3  a
bad   a  b
good  1  2
bad   3  a


df = pd.read_csv(StringIO(data), skiprows=lambda index: 'bad' == index)
print (df)

      i  a  b
0  good  1  2
1   bad  3  a
2   bad  a  b
3  good  1  2
4   bad  3  a

Verifying sample solution from pandas documentation:

df = pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
print (df)
      i  a  b
0   bad  3  a
1  good  1  2

df = pd.read_csv(StringIO(data), index_col='i',skiprows=lambda x: x % 2 != 0)
print (df)
      a  b
i         
bad   3  a
good  1  2

EDIT: Possible solution with preprocessing data for positions for skip:

df = pd.read_csv('a.csv')
print (df)
      i  a  b
0  good  1  2
1   bad  3  a
2   bad  a  b
3  good  1  2
4   bad  3  a

#preprocessing
def get_row(data):
    out = []
    with open('a.csv', 'r') as csvfile:
        reader = csv.reader(csvfile)
        for i, row in enumerate(reader):
            if row[0] == data:
                out.append(i)
    return out


skip = get_row('bad')            
print(skip)
[2, 3, 5]

df = pd.read_csv('a.csv', skiprows=get_row('bad') )
print (df)
      i  a  b
0  good  1  2
1  good  1  2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The doc says "If callable, the callable function will be evaluated against the row indices" - should it not work on the newly set index? Is what you are saying documented? – DurgaDatta May 05 '20 at 05:01
  • @DurgaDatta - yes, I was first surpise too, but last test it and added to answer to end. It seems it alwasy test by positions, not by labels names. – jezrael May 05 '20 at 05:01
  • So, to skip rows by arbitrary function, we have to do it once the data is loaded? This also implies that I can apply dtype (bad columns will fail them ), and also be careful in converter functions. – DurgaDatta May 05 '20 at 05:03
  • @DurgaDatta - What I remember from some past pandas versions is not possible filter by some value in `read_csv`, I think first [sentence](https://stackoverflow.com/a/13653490/2901002) is unfortunately still True also in last pandas version. – jezrael May 05 '20 at 05:06
  • @DurgaDatta - Added possible solution. – jezrael May 05 '20 at 06:04