pandas read_csv and keep only certain rows (python)

Question

I am aware of the skiprows that allows you to pass a list with the indices of the rows to skip. However, I have the index of the rows I want to keep.

Say that my cvs file looks like this for millions of rows:

The list of indices i would like to load are only 2,3, so

index_list = [2,3]

The input for the skiprows function would be [0,1,4]. However, I only have available [2,3].

I am trying something like:

pd.read_csv(path, skiprows = ~index_list)

but no luck.. any suggestions?

thank and I appreciate all the help,

Can you provide the exact code instead of a template? – Sreejith Menon Sep 06 '16 at 00:57 — Sreejith Menon, Sep 06 '16 at 00:57
@ Sreejith hopefully its more readable now. – dleal Sep 06 '16 at 01:20 — dleal, Sep 06 '16 at 01:20

score 18 · Answer 1 · answered Mar 23 '19 at 15:50

18

You can pass in a lambda function in the skiprows argument. For example:

rows_to_keep = [2,3]
pd.read_csv(path, skiprows = lambda x: x not in rows_to_keep)

You can read more about it in the documentation here

answered Mar 23 '19 at 15:50

wcyn

4

[I did some testing](https://i.imgur.com/ljLEmkt.jpg) and found that for the argument `skiprows`, **passing a list is much faster than passing a lambda function.** Passing a list appears to be O(1), whereas passing a lambda func is O(N). So for very large CSV files, I strongly recommend generating the list of rows to skip from a list of known rows to keep first, like gabra's answer. *(Results as of pandas v1.4.1)* – mimocha Mar 30 '22 at 07:41

score 13 · Accepted Answer · edited May 23 '17 at 10:29

13

I think you would need to find the number of lines first, like this.

num_lines = sum(1 for line in open('myfile.txt'))

Then you would need to delete the indices of index_list:

to_exclude = [i for i in num_lines if i not in index_list]

and then load your data:

pd.read_csv(path, skiprows = to_exclude)

edited May 23 '17 at 10:29

Community

answered Sep 06 '16 at 02:14

gabra

Thank you gabra I figured I would have to do something like this. It seems odd that there is skiprows but not one to read certain rows – dleal Sep 06 '16 at 02:19
@dleal I agree with you. [This](http://stackoverflow.com/q/13651117/2029132) also relates to your question. – gabra Sep 06 '16 at 02:22
4

you'd need to put `[i for i in range(num_lines) if i not in index_list]` right ? num_lines is not iterable, is an integer – Nabla Apr 17 '20 at 09:31

score 1 · Answer 3 · answered Jul 06 '22 at 09:43

Another simple solution to this could be to call .loc right after read_csv. Something like this

index_to_keep = [2, 4]
pd.read_csv(path).loc[index_to_keep]

Note: This is a slower approach, as here the entire file will be first loaded in the memory and then only seleted rows will be selected.

3 Answers3