I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:
import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
slim_list.append(i)
which holds the rows of catalog
which have non-nan values. I then make a new catalog with those rows as entries
slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
data = (catalog.iloc[j]).to_dict()
slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')
This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?