3

I have a huge dataframe composed of 7 columns. Extract:

45589   664865.0    100000.0    7.62275    -.494     1.60149      100010
...
57205   718888.0    100000.0    8.218463    -1.405-3     1.75137      100010
...
55143   711827.0    100000.0    8.156107    9.8336-3    1.758051      100010

As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :

df= df.astype('float')

However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.

I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:

df= df.replace('E\-', '-', regex=True)

However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex

str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)

However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.

Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?

Thank you for your help!

Osamoele
  • 371
  • 4
  • 15
elle.delle
  • 328
  • 3
  • 15

4 Answers4

2

You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:

df = df.replace(r'\s', '', regex=True)      # remove any unwanted spaces 
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)

Result:

print(df)

        0         1         2         3          4         5       6
0  45589  664865.0  100000.0   7.62275      -.494   1.60149  100010
1  57205  718888.0  100000.0  8.218463  -1.405E-3   1.75137  100010
2  55143  711827.0  100000.0  8.156107  9.8336E-3  1.758051  100010
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Hi ! Thanks for your answer. However, it did add a few 'E-' to my negative values so I get the error: "ValueError: could not convert string to float: ' E-.05613'". The weird part is that it partially worked though... Thanks for your input nonetheless – elle.delle Jun 28 '21 at 13:58
  • @elle.delle Would there be any leading spaces before the first hypen ? – SeaBean Jun 28 '21 at 14:01
  • 1
    @elle.delle See my edit to perform cleaning up of unwanted spaces before execution. – SeaBean Jun 28 '21 at 14:05
1

Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))

my_str = '-1.23-4'
def f(x):
  first_part = my_str[0]
  remaining_part = my_str[1:]
  remaining_part = remaining_part.replace('-', 'E-')
  return first_part + remaining_part

Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):

df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))
daveydave
  • 71
  • 4
1

You could use groups as specified in this thread, to select the number before you exponent so that :

  • first : the match only ocurs when the minus is preceded by values
  • and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.

This gives :

df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)

(You can verify it on regexp tester)

Osamoele
  • 371
  • 4
  • 15
1

I tried this example and worked:

import pandas as pd

df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')
bruno-uy
  • 1,647
  • 12
  • 20
  • This works, thanks a lot! The only downside is that it "just" does column by column but that is still a major improvement then from cell to cell. Thanks for your input, I really appreciate it! – elle.delle Jun 28 '21 at 14:04