2

So, I read CSV-files that are generated using excel. Those can contain empty columns and rows on the right side - resp. below the data range/table. Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.

In a first iteration I just used

pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False) 

which seemed to work fine. But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.

I do want to keep all columns that have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column

So, per https://stackoverflow.com/a/43983654/2215053 I first used

unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)

which looks and feels clean, but it scrambles the column order.

So now I go with:

df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]

Which works. But there should be an obviously right way to accomplish this frequently occuring task? So how could I do this better?

Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?

jottbe
  • 4,228
  • 1
  • 15
  • 31
POe
  • 95
  • 4
  • I guess the line `df = [[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]` should actually read `df = df[[` right? but that doesn't really drop the column, it just creates a slice object which is something like a view on the existing dataframe and behaves a bit different (not all operations are supported by slices). – jottbe Jan 26 '21 at 11:42

1 Answers1

1

Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:

# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)
jottbe
  • 4,228
  • 1
  • 15
  • 31