0

I have a dataframe :

A            B
10.1        33.3
11.2        44.2s
12.3        11.3s
14.2s       *
15.4s       nan

i want output as

A            B
10.1        33.3
11.2        44.2
12.3        11.3
14.2          0
15.4         0

How do I remove these tailing alphabets

I have tried this code

1st approach:

bulb_temp_df['A'].str.extract('(\d)').astype(float)
bulb_temp_df['B'].str.extract('(\d)').astype(float)

2nd approach:

bulb_temp_df['A'] = 
bulb_temp_df['A'].astype(str)
bulb_temp_df['A'] = 
bulb_temp_df['A'].map(lambda x: x.rstrip('aAbBcC'))

These are not working. They are not removing the tailing s from the columns.

third approach

bulb_temp_df[cols]=bulb_temp_df[cols].apply(lambda x:x.str.extract('(\d+\.\d+)',expand=False)
                                .astype(float)
                                .fillna(0))`

All these are not working. The last one removes the tailing s but it converts the values without 's' to zero or nan.

neilfws
  • 32,751
  • 5
  • 50
  • 63
PriyalChaudhari
  • 363
  • 1
  • 7
  • 23
  • Try to strip data after used `rstrip`. Example `x.rstrip('aAbBcC').strip()` – Mauro Baraldi Jun 22 '17 at 23:55
  • Thought this was an r question so if you know how to use that you can just do this `x2<-gsub("[^0-9.]", "", as.character(unlist(x)));x2[x2==""]<-0;x3<-matrix(as.numeric(x2),5,2);colnames(x3)<-c("A","B");x3`. – CCurtis Jun 22 '17 at 23:58

3 Answers3

2

First, I´m going to create a reproducible example:

from io import StringIO
import re
import numpy as np
import pandas as pd

s = StringIO('''\
       A           B
1   10.1        33.3
2   11.2       44.2s
3   12.3       11.3s
4   14.2s          *
5   15.4s        nan
''')


df = pd.read_table(s, sep='\s\s+',engine='python')
df['A'] = df['A'].astype(str)
df['B'] = df['B'].astype(str)

Now, you can use regex and re.sub:

df = df.applymap(lambda x: re.sub(r'[^0-9^\-\.]+', '', x)).replace('', np.float64(0)).astype('float64') 

print(df)

Output:

      A     B
1  10.1  33.3
2  11.2  44.2
3  12.3  11.3
4  14.2   0.0
5  15.4   0.0
FJSevilla
  • 3,733
  • 1
  • 13
  • 20
0

This isn`t super clean, but it works!

import re
import pandas as pd

def cleanup(df, column):
    temp = str(df[column])
    vals = re.split("(\d+\.\d+|\*|NaN)", temp)[1::2]
    out = [float(a.replace('*', '0').replace('NaN', '0')) for a in vals]

    return out

bulb_temp_df = pd.DataFrame(data=list(zip(cleanup(bulb_temp_df, 'A'), cleanup(bulb_temp_df, 'B'))), columns=['A', 'B'])

What this does is it makes a new pandas dataframe with the formatted data. It formats it by searching for the regex (\d+\.\d+|\*|NaN) which will find all floats, *'s or NaN's. Using this regex, we split the column which is turned into a string based on finding one of the 3 mentioned patterns. Then we take every odd index, since that holds what were interested in.

Now, we have a list of strings on the floats without s, and the * and Nans. the list comprehension replaces all the *'s and NaN's with 0 and casts all the elements to float (since they are still strings).

Not elegant, but it works!

jacoblaw
  • 1,263
  • 9
  • 9
0

I found this to be a simple approach (as answered here)-
Use replace to retain only the digits (and dot and minus sign).
This would remove characters, alphabets or anything that is not defined in to_replace attribute.

So, the solution is:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'']
df['A1'] = df['A1'].astype(float64)

CuriousCoder
  • 491
  • 5
  • 9