1

I have the "Weight " column in my data frame but in CSV file, there are many of unwanted text, and I need to remove the letters and all characters except (.) the dot from column example:

import pandas as pd

df  = pd.DataFrame(
    [
        (1, '+9.1A', 100),
        (2, '-1A', 121),
        (3, '5B', 312),
        (4, '+1D', 567),
        (5, '+1C', 123),
        (6, '-2E', 101),
        (7, '+3T', 231),
        (8, '5A', 769),
        (9, '+5B', 907),
        (10, 'text', 15),
    ],
    columns=['colA', 'weight', 'colC']
)
print(df)




the expected result is :

enter image description here

Real Example

df  = pd.DataFrame(
    [
        (0,68),
        (1,67),
        (2,68.1),
        (3,97.1),
        (4,113.9),
        (5,114),
        (6,112),
        (7,111.8),
        (8,111),
        (9,110.8),
        (10,111.2),
        (11,),
        (12,111.5),
        (13,'Not Appropriate at t'),

    ],
    columns=['colA', 'weight']
)
print(df)
Marwa
  • 97
  • 7

2 Answers2

1

You can use pandas.Series.str.extract:

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#   colA weight  colC
#0     1    9.1   100
#1     2      1   121
#2     3      5   312
#3     4      1   567
#4     5      1   123
#5     6      2   101
#6     7      3   231
#7     8      5   769
#8     9      5   907
#9    10    NaN    15

For the real data example, before you have to convert the column to a str column:

df["weight"] = df["weight"].astype("str")

df["weight"] = df["weight"].str.extract("(\d+\.?\d*)")

df

#    colA weight
#0      0     68
#1      1     67
#2      2   68.1
#3      3   97.1
#4      4  113.9
#5      5    114
#6      6    112
#7      7  111.8
#8      8    111
#9      9  110.8
#10    10  111.2
#11    11    NaN
#12    12  111.5
#13    13    NaN
Pablo C
  • 4,661
  • 2
  • 8
  • 24
0

You can use regex and apply to remove those parts of you column:

import re

def filter_number(x):
    # With + and - sign
    # number = re.search(r'(\-?\d+\.?\d*)', x)
    # without + and - sign
    number = re.search(r'(\d+\.?\d*)', x)
    if number:
        return float(number.groups()[0])
    return np.nan

df.weight = df.weight.apply(filter_number)
3dSpatialUser
  • 2,034
  • 1
  • 9
  • 18