0

I want to find all the strings in my dataframe and I want to replace them with NaN values so that I can drop all associated NaN values with the function df.dropna(). For example, if I have the following data set:

x = np.array([1,2,np.NaN,4,5,6,7,8,9,10])
z = np.array([1,2,np.NaN,4,5,np.NaN,7,8,9,"My Name is Jeff"])
y = np.array(["Hello World",2,3,4,5,6,7,8,9,10])

I should first be able to dynamically replace all strings with np.nan so my output should be:

x = np.array([1,2,np.NaN,4,5,6,7,8,9,10])
z = np.array([1,2,np.NaN,4,5,np.NaN,7,8,9,np.NaN])
y = np.array([np.NaN,2,3,4,5,6,7,8,9,10])

and then running df.dropna() (Assume that x,y,z reside in a data frame and not just separate variables) should allow me to have:

x = np.array([2,4,5,7,8,9])
z = np.array([2,4,5,7,8,9])
y = np.array([2,4,5,7,8,9])
rafaelc
  • 57,686
  • 15
  • 58
  • 82
Zakariah Siyaji
  • 989
  • 8
  • 27
  • The dtypes of the first definitions are `float` and `string`. in the second, all `float`. Then `int`. In pandas columns with strings will be `object`. I think the `nan` columns will still be float, but may be object. If you are starting with a dataframe, I'd suggest defining/showing that rather than numpy arrays. – hpaulj Jul 16 '19 at 01:07

4 Answers4

3

Since you tag pandas

pd.to_numeric(x,errors='coerce')
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Please find the following:

df = pd.DataFrame([x, y, z])

def Replace(i):
    try:
        float(i)
        return float(i)
    except:
           return np.nan

df = df.applymap(func=Replace)
df.dropna(axis=1)

Output

Shiva
  • 33
  • 5
0

This works I think:

df = pd.DataFrame(data={'A':[1,2,'str'],'B':['name',2,2]})
for column in df.columns:
    df[column]=df[column].apply(lambda x:np.nan if type(x)==str else x)
print(df)
Parijat Bhatt
  • 664
  • 4
  • 6
  • That'd work but would be extremely slow. `pd.to_numeric` is preferred ! Also you could use just `df.applymap` with same lambda, no need for iterating and assigning manually – rafaelc Jul 16 '19 at 00:37
  • Could you please show me how to apply this to code. The problem that I am running into is that pd.numeric works for a Pandas Series while I am working with a data frame. – Zakariah Siyaji Jul 16 '19 at 00:41
0

I think the following is the simplest rendition: The function called "cleanData" takes in a file as an argument and an array of columns that you may want to ignore. It will then replace all of the strings in the file with NaN values and then it will drop those NaN values.

def cleanData(file, ignore=[]):
    for column in file.columns:
        if len(ignore) is not 0:
            if column not in ignore:
                file[column] = file[column].apply(pd.to_numeric, errors='coerce')
        else:
            file[column] = file[column].apply(pd.to_numeric, errors='coerce')
    file = file.dropna()
    return file
Zakariah Siyaji
  • 989
  • 8
  • 27