3

Given the following pandas dataframe

+----+------------------+-------------------------------------+--------------------------------+
|    |   AgeAt_X        |   AgeAt_Y                           |   AgeAt_Z                      |
|----+------------------+-------------------------------------+--------------------------------+
|  0 |   Older than 100 |                      Older than 100 |                          74.13 |
|  1 |              nan |                                 nan |                          58.46 |
|  2 |              nan |                                 8.4 |                          54.15 |
|  3 |              nan |                                 nan |                          57.04 |
|  4 |              nan |                               57.04 |                            nan |
+----+------------------+-------------------------------------+--------------------------------+

how can I replace values in specific columns which equal Older than 100 with nan

+----+------------------+-------------------------------------+--------------------------------+
|    |   AgeAt_X        |   AgeAt_Y                           |   AgeAt_Z                      |
|----+------------------+-------------------------------------+--------------------------------+
|  0 |              nan |                                 nan |                          74.13 |
|  1 |              nan |                                 nan |                          58.46 |
|  2 |              nan |                                 8.4 |                          54.15 |
|  3 |              nan |                                 nan |                          57.04 |
|  4 |              nan |                               57.04 |                            nan |
+----+------------------+-------------------------------------+--------------------------------+

Notes

  • After removing the Older than 100 string from the desired columns, I convert the columns to numeric in order to perform calculations on said columns.
  • There are other columns in this dataframe (that I have excluded from this example), which will not be converted to numeric, so the conversion to numeric must be done one column at a time.

What I've tried

Attempt 1

if df.isin('Older than 100'):
    df.loc[df['AgeAt_X']] = ''
else:
    df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])

Attempt 2

if df.loc[df['AgeAt_X']] == 'Older than 100r':
    df.loc[df['AgeAt_X']] = ''
elif df.loc[df['AgeAt_X']] == '':
    df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])

Attempt 3

df['AgeAt_X'] = ['' if ele == 'Older than 100' else df.loc[df['AgeAt_X']] for ele in df['AgeAt_X']]

Attempts 1, 2 and 3 return the following error:

KeyError: 'None of [0 NaN\n1 NaN\n2 NaN\n3 NaN\n4 NaN\n5 NaN\n6 NaN\n7 NaN\n8 NaN\n9 NaN\n10 NaN\n11 NaN\n12 NaN\n13 NaN\n14 NaN\n15 NaN\n16 NaN\n17 NaN\n18 NaN\n19 NaN\n20 NaN\n21 NaN\n22 NaN\n23 NaN\n24 NaN\n25 NaN\n26 NaN\n27 NaN\n28 NaN\n29 NaN\n ..\n6332 NaN\n6333 NaN\n6334 NaN\n6335 NaN\n6336 NaN\n6337 NaN\n6338 NaN\n6339 NaN\n6340 NaN\n6341 NaN\n6342 NaN\n6343 NaN\n6344 NaN\n6345 NaN\n6346 NaN\n6347 NaN\n6348 NaN\n6349 NaN\n6350 NaN\n6351 NaN\n6352 NaN\n6353 NaN\n6354 NaN\n6355 NaN\n6356 NaN\n6357 NaN\n6358 NaN\n6359 NaN\n6360 NaN\n6361 NaN\nName: AgeAt_X, Length: 6362, dtype: float64] are in the [index]'

Attempt 4

df['AgeAt_X'] = df['AgeAt_X'].replace({'Older than 100': ''})

Attempt 4 returns the following error:

TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'

I've also looked at a few posts. The two below do not actually replace the value but create a new column derived from others

Replace specific values in Pandas DataFrame

Pandas replace DataFrame values

ALollz
  • 57,915
  • 7
  • 66
  • 89
n8-da-gr8
  • 541
  • 6
  • 20
  • 1
    Is there any reason why you'd want to keep *other non-numeric* values in the columns? If there isn't and it just turns out that `'Older than 100'` is the only problematic string, then the way to go is `pd.to_numeric(df['col_name'], errors='coerce')` – ALollz Mar 27 '19 at 21:53
  • @ALollz, good idea. I'll try this and report back tomorrow. Thanks! – n8-da-gr8 Mar 27 '19 at 22:09
  • 1
    @ALollz, I believe your solution is the best by far. From the docs `If ‘coerce’, then invalid parsing will be set as NaN` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html). This is exactly want I need and only requires setting a parameter and no additional code. Thanks! – n8-da-gr8 Mar 28 '19 at 13:40

2 Answers2

3

We can loop through each column and check if the sentence is present. If we get a hit, we replace the sentence with NaN with Series.str.replace and right after convert it to numeric with Series.astype, in this case float:

df.dtypes
AgeAt_X     object
AgeAt_Y     object
AgeAt_Z    float64
dtype: object

sent = 'Older than 100'

for col in df.columns:
    if sent in df[col].values:
        df[col] = df[col].str.replace(sent, 'NaN')
        df[col] = df[col].astype(float)

print(df)
   AgeAt_X  AgeAt_Y  AgeAt_Z
0      NaN      NaN    74.13
1      NaN      NaN    58.46
2      NaN     8.40    54.15
3      NaN      NaN    57.04
4      NaN    57.04      NaN

df.dtypes
AgeAt_X    float64
AgeAt_Y    float64
AgeAt_Z    float64
dtype: object
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks @Erfan, this solution saved me the step of identifying the columns to convert to numeric individually. – n8-da-gr8 Mar 27 '19 at 19:07
  • Nice solution, but the for loop and `.astype(float)` lines are unnecessary, because if the replacement results in a column with all numeric values, pandas automatically casts that column to numeric data type. A shorter solution is `df.replace('Older than 100', np.nan, inplace=True)` – Peter Leimbigler Mar 27 '19 at 20:56
2

If I understand you correctly, you can replace all occurrences of Older than 100 with np.nan with a single call to DataFrame.replace. If all remaining values are numeric, then the replace will implicitly change the data type of the column to numeric:

# Minimal example DataFrame
df = pd.DataFrame({'AgeAt_X': ['Older than 100', np.nan, np.nan],
                   'AgeAt_Y': ['Older than 100', np.nan, 8.4],
                   'AgeAt_Z': [74.13, 58.46, 54.15]})
df
          AgeAt_X         AgeAt_Y  AgeAt_Z
0  Older than 100  Older than 100    74.13
1             NaN             NaN    58.46
2             NaN             8.4    54.15

df.dtypes
AgeAt_X     object
AgeAt_Y     object
AgeAt_Z    float64
dtype: object

# Replace occurrences of 'Older than 100' with np.nan in any column
df.replace('Older than 100', np.nan, inplace=True)

df
   AgeAt_X  AgeAt_Y  AgeAt_Z
0      NaN      NaN    74.13
1      NaN      NaN    58.46
2      NaN      8.4    54.15

df.dtypes
AgeAt_X    float64
AgeAt_Y    float64
AgeAt_Z    float64
dtype: object
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • 1
    This works for his example, but this would not be sufficient while converting to numeric, since it wont check of presence `Older than 100` in each column. – Erfan Mar 27 '19 at 18:50
  • Thanks @Peter Leimbigler, your solution did work but Erfan's solution did as well but also mitigated the need for me to identify the columns I wanted to convert to numeric, because if they contained `Older than 100`, I was going to convert them anyways. – n8-da-gr8 Mar 27 '19 at 19:05
  • In fact, there is no need for explicit conversion to numeric data types. If the string replacement results in a column with all numeric values, pandas implicitly converts the data type of that column to a numeric type, in this case `float64`. I've edited my answer to reflect this behaviour. – Peter Leimbigler Mar 27 '19 at 20:53
  • Though, with an object column that contains numeric-like values and strings, `pandas` will typically read in `1` as `'1'`, and in those cases it doesn't get typecast (Data like `7,8;foo,9`) – ALollz Mar 27 '19 at 21:58
  • @Alollz, good point, doesn't look like that's the case with this example data, but may occur in other columns. – Peter Leimbigler Mar 27 '19 at 22:28