Selecting 1.6M rows of a pandas dataframe

Question

I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:

import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
    if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
        slim_list.append(i)

which holds the rows of catalog which have non-nan values. I then make a new catalog with those rows as entries

slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
    data = (catalog.iloc[j]).to_dict()
    slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')

This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?

did you try `dfx = df[df['z'].notnull()]` to see what it results in? — Joe Ferndz, Nov 23 '20 at 21:30
I think you could filter and save your dataset`catalog[(catalog['z'].notna()) & (catalog['B'].notna())].to_csv('slim_catalog.csv')` — Maximilian Peters, Nov 23 '20 at 21:34

score 1 · Accepted Answer · answered Nov 23 '20 at 21:47

1

This is the completely wrong way of doing this in pandas.

Firstly, never iterate over some range, i.e. for i in range(len(catalog)): and then individually index into the row: catalog['z'][i], that is incredibly inefficient.

Second, do not create a pandas.DataFrame using pd.DataFrame.append in a loop, that is a linear operation, so the entire thing will be quadratic time.

But you shouldn't be looping here to begin with. All you need is something like

catalog[catalog.loc[:, ['z', 'B']].notna().all(axis=1)].to_csv('slim_catalog.csv')

Or broken up to perhaps be more readable:

not_nan_zB = catalog.loc[:, ['z', 'B']].notna().all(axis=1)
catalog[not_nan_zB].to_csv('slim_catalog.csv')

answered Nov 23 '20 at 21:47

juanpa.arrivillaga

88,713
10
131
172

This is why physicists should need a license to code :) - I'll try this once the catalog loads in (which it only does 50% of the time) and accept this if it works. – user3517167 Nov 23 '20 at 22:01
1

@user3517167 heh, I actually work with several physicists (by training). By now they are all solid coders, but definitely, a lot of people who are just thrown into the deep end of pandas etc will need to learn some stuff – juanpa.arrivillaga Nov 23 '20 at 22:03

Selecting 1.6M rows of a pandas dataframe

1 Answers1