Replace '-' by 'E-' in dataframe cell IF the '-' is in the middle of a string

Question

I have a huge dataframe composed of 7 columns. Extract:

45589   664865.0    100000.0    7.62275    -.494     1.60149      100010
...
57205   718888.0    100000.0    8.218463    -1.405-3     1.75137      100010
...
55143   711827.0    100000.0    8.156107    9.8336-3    1.758051      100010

As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :

df= df.astype('float')

However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.

I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:

df= df.replace('E\-', '-', regex=True)

However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex

str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)

However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.

Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?

Thank you for your help!

SeaBean · Accepted Answer · 2021-06-28T14:04:44.450

2

You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:

df = df.replace(r'\s', '', regex=True)      # remove any unwanted spaces 
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)

Result:

print(df)

        0         1         2         3          4         5       6
0  45589  664865.0  100000.0   7.62275      -.494   1.60149  100010
1  57205  718888.0  100000.0  8.218463  -1.405E-3   1.75137  100010
2  55143  711827.0  100000.0  8.156107  9.8336E-3  1.758051  100010

edited Jun 28 '21 at 14:04

answered Jun 28 '21 at 13:07

SeaBean

22,547
3
13
25

Hi ! Thanks for your answer. However, it did add a few 'E-' to my negative values so I get the error: "ValueError: could not convert string to float: ' E-.05613'". The weird part is that it partially worked though... Thanks for your input nonetheless – elle.delle Jun 28 '21 at 13:58
@elle.delle Would there be any leading spaces before the first hypen ? – SeaBean Jun 28 '21 at 14:01
1

@elle.delle See my edit to perform cleaning up of unwanted spaces before execution. – SeaBean Jun 28 '21 at 14:05

daveydave · Answer 2 · 2021-06-28T13:09:27.683

Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))

my_str = '-1.23-4'
def f(x):
  first_part = my_str[0]
  remaining_part = my_str[1:]
  remaining_part = remaining_part.replace('-', 'E-')
  return first_part + remaining_part

Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):

df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))

A bit more complex than @Osamoele 's answer, but works too, thanks a lot for your help :) — elle.delle, Jun 28 '21 at 14:07

Osamoele · Answer 3 · 2021-09-17T20:58:37.703

1

You could use groups as specified in this thread, to select the number before you exponent so that :

first : the match only ocurs when the minus is preceded by values
and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.

This gives :

df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)

(You can verify it on regexp tester)

edited Sep 17 '21 at 20:58

answered Jun 28 '21 at 13:08

Osamoele

371
4
15

1

This doesn't work, but this works: `df.replace({r'(\d)-' : r'\1E-'}, inplace=True, regex=True)` – bruno-uy Jun 28 '21 at 13:14
1

Thanks, I modified it in the answer. – Osamoele Jun 28 '21 at 13:18
1

Worked like a charm, it's possibly the easiest and most elegant answer too, thanks a lot! – elle.delle Jun 28 '21 at 14:06
2

@elle.delle This also matches `-` at the end of strings, instead of at the middle. Use with care! – SeaBean Jun 28 '21 at 14:14
1

@SeaBean Ooh thanks for the tip, I'll be careful about that! – elle.delle Jun 28 '21 at 14:20

score 1 · Answer 4 · answered Jun 28 '21 at 13:10

1

I tried this example and worked:

import pandas as pd

df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')

answered Jun 28 '21 at 13:10

bruno-uy

1,647
12
20

This works, thanks a lot! The only downside is that it "just" does column by column but that is still a major improvement then from cell to cell. Thanks for your input, I really appreciate it! – elle.delle Jun 28 '21 at 14:04

Replace '-' by 'E-' in dataframe cell IF the '-' is in the middle of a string

4 Answers4