4

I am trying to extract only numbers and only strings in two different dataframes. I am using regular expression to extract numbers and string.

import pandas as pd

df_num = pd.DataFrame({
        'Colors': ['lila1.5', 'rosa2.5', 'gelb3.5', 'grün4', 'rot5', 'schwarz6', 'grau7', 'weiß8', 'braun9', 'hellblau10'],
        'Animals': ['hu11nd', '12welpe', '13katze', 's14chlange', 'vo15gel', '16papagei', 'ku17h', '18ziege', '19pferd',
                    'esel20']
    })

for column in df_num.columns:
    df_num[column] = df_num[column].str.extract('(\d+)').astype(float)

print(df_num)

I have also tried using '([\d+][\d+\.\d+])' and '([\d+\.\d+])'

Here I am getting output but not what I am expecting. Though I am expecting float numbers I am not getting 1.5 or 2.5.

I am getting something like below image:

enter image description here

df_str = pd.DataFrame({
        'Colors': ['lila1.5', 'rosa2.5', 'gelb3', 'grün4', 'rot5', 'schwarz6', 'grau7', 'weiß8', 'braun9', 'hellblau10'],
        'Animals': ['hu11nd', '12welpe', '13katze', 's14chlange', 'vo15gel', '16papagei', 'ku17h', '18ziege', '19pferd',
                    'esel20']
    })

for column in df_str.columns:
    df_str[column] = df_str[column].str.extract('([a-zA-Z]+)')

print(df_str)

In this case when the number is at the end or in the beginning then I am getting the string but if the number placed in the middle or any other place then the result which I expect I am not getting. Current output is like below image:

enter image description here

I think my regular expression is not correct. Which will be the right regular expression to solve these problems? Or is there any other way to extract only numbers and only strings in pandas dataframe?

ah bon
  • 9,293
  • 12
  • 65
  • 148
BC Smith
  • 727
  • 1
  • 7
  • 19
  • Does this answer your question? [How to extract a floating number from a string](https://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string) – Rakesh Feb 19 '20 at 08:20

4 Answers4

7

Your code is on the right track, you just need to account for the decimals and the possibility of integers :

df_num['colors_num'] = df_num.Colors.str.extract(r'(\d+[.\d]*)')
df_num['animals_num'] = df_num.Animals.str.extract(r'(\d+[.\d]*)')
df_num['colors_str'] = df_num.Colors.str.replace(r'(\d+[.\d]*)','')
df_num['animals_text'] = df_num.Animals.str.replace(r'(\d+[.\d]*)','')


    Colors  Animals colors_num  animals_num colors_str  animals_text
0   lila1.5 hu11nd  1.5 11  lila    hund
1   rosa2.5 12welpe 2.5 12  rosa    welpe
2   gelb3.5 13katze 3.5 13  gelb    katze
3   grün4   s14chlange  4   14  grün    schlange
4   rot5    vo15gel 5   15  rot vogel
5   schwarz6    16papagei   6   16  schwarz papagei
6   grau7   ku17h   7   17  grau    kuh
7   weiß8   18ziege 8   18  weiß    ziege
8   braun9  19pferd 9   19  braun   pferd
9   hellblau10  esel20  10  20  hellblau    esel
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
2

You could use (\d+\.\d+|\d+) to extract your numbers, and replace the results with "" to get your string.

print (df_num.assign(colors_num=df_num["Colors"].str.extract(r"(\d+\.\d+|\d+)"))
             .assign(colors_col=df_num["Colors"].str.replace(r"(\d+\.\d+|\d+)","")))

       Colors     Animals colors_num colors_col
0     lila1.5      hu11nd        1.5       lila
1     rosa2.5     12welpe        2.5       rosa
2     gelb3.5     13katze        3.5       gelb
3       grün4  s14chlange          4       grün
4        rot5     vo15gel          5        rot
5    schwarz6   16papagei          6    schwarz
6       grau7       ku17h          7       grau
7       weiß8     18ziege          8       weiß
8      braun9     19pferd          9      braun
9  hellblau10      esel20         10   hellblau
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
2

The easiest way to go about this is to define some functions:

def text(x):
    return x.str.replace(r'[0-9.]+','')
def values(x):
    return x.str.extract(r'([0-9.]+)', expand = False)

df_str.transform([text,values])

          Colors          Animals       
       text values      text values
0      lila    1.5      hund     11
1      rosa    2.5     welpe     12
2      gelb      3     katze     13
3      grün      4  schlange     14
4       rot      5     vogel     15
5   schwarz      6   papagei     16
6      grau      7       kuh     17
7      weiß      8     ziege     18
8     braun      9     pferd     19
9  hellblau     10      esel     20
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

You may take advantage of built-in str methods isnumeric() or isalpha() instead of regular expressions. See below:

# get rid of letters and handle floating points
>>> "".join([c for c in "word234with23numbers" if c.isnumeric() or c == "."])
"23423"

>>> "".join([c for c in "gelb3.5" if c.isnumeric() or c == "."])
"3.5"

# get rid of numbers
>>> "".join([c for c in "word234with23numbers" if c.isalpha()])
"wordwithnumbers"
arnaud
  • 3,293
  • 1
  • 10
  • 27