removing text and characters values from column in data frame

Question

I have the "Weight " column in my data frame but in CSV file, there are many of unwanted text, and I need to remove the letters and all characters except (.) the dot from column example:

import pandas as pd

df  = pd.DataFrame(
    [
        (1, '+9.1A', 100),
        (2, '-1A', 121),
        (3, '5B', 312),
        (4, '+1D', 567),
        (5, '+1C', 123),
        (6, '-2E', 101),
        (7, '+3T', 231),
        (8, '5A', 769),
        (9, '+5B', 907),
        (10, 'text', 15),
    ],
    columns=['colA', 'weight', 'colC']
)
print(df)

the expected result is :

Real Example

df  = pd.DataFrame(
    [
        (0,68),
        (1,67),
        (2,68.1),
        (3,97.1),
        (4,113.9),
        (5,114),
        (6,112),
        (7,111.8),
        (8,111),
        (9,110.8),
        (10,111.2),
        (11,),
        (12,111.5),
        (13,'Not Appropriate at t'),

    ],
    columns=['colA', 'weight']
)
print(df)

noting that i tried .str.replace(r'\D', '') but it remove the dot — Marwa, May 26 '23 at 12:34

Pablo C · Accepted Answer · 2023-05-26T14:51:46.803

1

You can use pandas.Series.str.extract:

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#   colA weight  colC
#0     1    9.1   100
#1     2      1   121
#2     3      5   312
#3     4      1   567
#4     5      1   123
#5     6      2   101
#6     7      3   231
#7     8      5   769
#8     9      5   907
#9    10    NaN    15

For the real data example, before you have to convert the column to a str column:

df["weight"] = df["weight"].astype("str")

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#    colA weight
#0      0     68
#1      1     67
#2      2   68.1
#3      3   97.1
#4      4  113.9
#5      5    114
#6      6    112
#7      7  111.8
#8      8    111
#9      9  110.8
#10    10  111.2
#11    11    NaN
#12    12  111.5
#13    13    NaN

edited May 26 '23 at 14:51

answered May 26 '23 at 12:43

Pablo C

4,661
2
8
24

You missed the + and - sign of the numbers. Otherwise a nice solution. – 3dSpatialUser May 26 '23 at 12:46
@3dSpatialUser according to OP's example, they don't have to be extracted – Pablo C May 26 '23 at 12:48
Sorry you are right. I gave you an upvote, sorry for my mistake. – 3dSpatialUser May 26 '23 at 12:49
@PabloC thanks for your answer. actually, I have newborn weight for real 18000 patients, with the same kind of entries in the weight column mixed text and numbers, when I apply this code to the demo example it works but for the real entries it converts all column entries to Nan, do you have any idea why this happened? – Marwa May 26 '23 at 14:21
@Marwa can you post a small part of the original df? – Pablo C May 26 '23 at 14:23
@PabloC I update the question the real sample of the data – Marwa May 26 '23 at 14:49
@Marwa doest it work now? – Pablo C May 26 '23 at 14:51

score 0 · Answer 2 · answered May 26 '23 at 12:46

You can use regex and apply to remove those parts of you column:

import re

def filter_number(x):
    # With + and - sign
    # number = re.search(r'(\-?\d+\.?\d*)', x)
    # without + and - sign
    number = re.search(r'(\d+\.?\d*)', x)
    if number:
        return float(number.groups()[0])
    return np.nan

df.weight = df.weight.apply(filter_number)

removing text and characters values from column in data frame

2 Answers2