0

I have csv which i read in a query from a windows folder.

files = glob.glob(r"LBT210*.csv")
dfs = [pd.read_csv(f, sep=";", engine='c') for f in files]
df2 = pd.concat(dfs,ignore_index=True)

However the output looks like:

columnA columnB columnC
1         1        0
2         0        A
NaN       NaN      1
3         B        D
...

How can I skip reading the rows, which contain a 'NaN' (none-value) in the columnB, so that i can save some memory and speed processing it? So I don't want to read the rows! I want to adjust:

dfs = [pd.read_csv(f, sep=";", engine='c') for f in files] somehow

lane
  • 766
  • 5
  • 20
PV8
  • 5,799
  • 7
  • 43
  • 87
  • is it OK to drop those after you have read them, or do you want to have never read them from the csv file? – lane Jan 27 '22 at 14:09
  • 1
    I want to drop them before i read them – PV8 Jan 27 '22 at 14:10
  • 3
    Does this answer your question? [How can I filter lines on load in Pandas read\_csv function?](https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function) – radrow Jan 27 '22 at 14:19
  • 1
    That question is from 10 years ago, so it is worth looking into – lane Jan 27 '22 at 14:26

1 Answers1

0

According to the selected answer from this question here there isn't a way to filter before the file is read into memory. Since this was from over 10 years ago, I also rechecked the read_csv options and it doesn't look like anything else may help.

After being inspired from the other stackoverflow question and selected answer, you can do something like this to reduce memory consumption.

iter_csv = pd.read_csv(f, sep=";", enine='c', iterator=True, chunksize=1000)
df = pd.concat([chunk[~chunk['columnB'].isna()] for chunk in iter_csv])
lane
  • 766
  • 5
  • 20