1

I need to clean up a dataframe whose columns come from different sources and have different types. This means that I can have, for example, string columns that contain "nan", "none", "NULL", (as a string instead of a None value).

My goal is to find all empty values and replace them with None. This works fine:

for column in df.columns:
    for idx, row in df.iterrows():
        if (str(row[column]).lower() == "none") or if (str(row[column]).lower() == "nan") or (str(row[column]).lower() == "null"):
            df.at[row.name, column] = None

But it is obviously not the best or fastest way to do it. How can I take advantage of Pandas operations or list comprehensions to do this substitution? Thanks!

Irina
  • 1,333
  • 3
  • 17
  • 37

4 Answers4

1

This seems to be a somewhat controversial topic (see e.g. this thread) but it's often said that list comprehensions are more computationally efficient than for loops, especially when iterating over pandas dataframes.

I also prefer using list comprehensions stylistically as it leads to fewer levels of indentation from nested loops/if statements.

Here's what it looks like for your use case:

for column in df.columns:
    vals_list = df[column].to_list()
    replaced = [None if str(x).lower() in ['nan', 'none', 'null'] else x for x in vals_list]
    df[column] = replaced
ljdyer
  • 1,946
  • 1
  • 3
  • 11
  • Thanks! Could you explain where my_series comes from? – Irina Nov 18 '22 at 15:18
  • 1
    Apologies, that should read ‘df[column]’. I have edited the answer and also added ‘str’ around x in the list comprehension. – ljdyer Nov 19 '22 at 16:07
1

Simple approach, use isin and mask:

df = pd.DataFrame([[1,2,'nan'],
                   ['none',3,'NULL']])

df_clean = df.mask(df.isin(["nan", "none", "NULL"]))

Or, if you want to update in place:

df[df.isin(["nan", "none", "NULL"])] = float('nan')

Output:

     0  1    2
0    1  2  NaN
1  NaN  3  NaN
mozway
  • 194,879
  • 13
  • 39
  • 75
  • seems like converting to a `str` and checking if the first character is an `n` would be more *efficient*, or just check `isinstance(blah, int)`, not sure about the `isinstance` overhead. – Jonny Henly Nov 18 '22 at 05:35
  • @JonnyHenly no, it would very likely be much less efficient. `isin` is quite fast. Also, this would potentially yield many false positives. – mozway Nov 18 '22 at 05:37
  • Is it implemented in C? It's hard to fathom that iterating over a length 3 array would be more efficient than comparing 2 characters. – Jonny Henly Nov 18 '22 at 05:39
  • Well, that's my bet (also `isin` is not iterating, it's based on hashes making search very fast). If you provide me with a counter-example I'll believe you. For now, you're the one making the speculation ;) – mozway Nov 18 '22 at 05:40
  • :) true, I don't really have a leg to stand on. – Jonny Henly Nov 18 '22 at 05:42
  • 1
    @Jonny `df = pd.DataFrame(np.random.choice([1,'none'], size=1_000_000).reshape(1000,1000))` then compare `df.isin(["nan", "none", "NULL"])` and `df.apply(lambda s: s.astype(str).str.startswith('n'))` with `timeit`. Checking the string is >10 times slower (and it has many flaws) ;) – mozway Nov 18 '22 at 05:49
  • And that's even without lowercasing the string. Game, set and match. I concede – Jonny Henly Nov 18 '22 at 05:57
  • Thanks for all the answers! Please notice that converting to a string and checking if the first character is "n" is not an option, as I have many valid string fields that start with an n. The idea is to convert to string, and, if the lowercase content is either "none", "nan", or "null", replace it with None in the most efficient way. – Irina Nov 18 '22 at 15:13
  • @Irina if you have specific strings, I believe my proposed method is both the most straightforward and most efficient. No need to convert to string. Case is important however, let me know if this is an issue. – mozway Nov 18 '22 at 15:28
0

A quick, and easy optimization:

for column in df.columns:
    for idx, row in df.iterrows():
        col = str(row[column]).lower()
        if (col == "none") or if (col == "nan") or (col == "null"):
            df.at[row.name, column] = None

No need to convert row[column] to a str and then iterate over each character 3 times.

Shorter code:

its_none = ['none', 'nan', 'null']
for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower() in its_none:
            df.at[row.name, column] = None

Even shorter (I imagine you're expecting a number) and more optimized:

for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower().startswith('n'):
            df.at[row.name, column] = None
Jonny Henly
  • 4,023
  • 4
  • 26
  • 43
0

If you want to use numpy you could do this as well (if the values in the fields are truly a string)

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name' : ['one', 'two', 'one', 'two'],
    'A' : ['null', 'none', 'empty', 'Keep']
})

df['A'] = np.where(df['A'].isin(['null', 'none', 'empty']), '', df['A'])
df
ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17
  • Thanks for the reply! The columns could or could not be a string; that is where the need to convert to str comes from before checking. – Irina Nov 18 '22 at 15:15