6

I am trying to use re.split() to split a single variable in a pandas dataframe into two other variables.

My data looks like:

   xg              
0.05+0.43
0.93+0.05
0.00
0.11+0.11
0.00
3.94-2.06

I want to create

 e      a
0.05  0.43
0.93  0.05
0.00  
0.11  0.11
0.00
3.94  2.06

I can do this using a for loop and and indexing.

for i in range(len(df)):
    if df['xg'].str.len()[i] < 5:
        df['e'][i] = df['xg'][i]
    else:
        df['e'][i], df['a'][i] = re.split("[\+ \-]", df['xg'][i])

However this is slow and I do not believe is a good way of doing this and I am trying to improve my code/python understanding.

I had made various attempts by trying to write it using np.where, or using a list comprehension or apply lambda but I can't get it too run. I think all the issues I have are because I am trying to apply the functions to the whole series rather than the positional value.

If anyone has an idea of a better method than my ugly for loop I would be very interested.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
oldlizard
  • 63
  • 1
  • 3
  • 1
    Possible duplicate of [how to split column of tuples in pandas dataframe?](https://stackoverflow.com/questions/29550414/how-to-split-column-of-tuples-in-pandas-dataframe) – Matthieu Brucher Nov 20 '18 at 22:04

3 Answers3

4

Borrowed from this answer using the str.split method with the expand argument: https://stackoverflow.com/a/14745484/3084939

df = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
df[['left','right']] = df['col'].str.split('[+|-]', expand=True)

df.head()
       col left right
0      1+2    1     2
1      3+4    3     4
2       20   20  None
3  0.6+1.6  0.6   1.6
wonderstruck80
  • 348
  • 2
  • 13
  • This is a much better method than the loop, I thought you could only split on a single delimiter. Thanks! – oldlizard Nov 20 '18 at 22:37
  • ```col left right 0 1+2 1 2 1 3+4 3 4 2 20 20 None 3 0.6-1.6 0.6 1.6 [Program finished] ```retention of sign is missing... – Subham Mar 24 '21 at 07:31
0

This may be what you want. Not sure it's elegant, but should be faster than a python loop.

import pandas as pd
import numpy as np

data = ['0.05+0.43','0.93+0.05','0.00','0.11+0.11','0.00','3.94-2.06']
df = pd.DataFrame(data, columns=['xg'])

# Solution
tmp = df['xg'].str.split(r'[ \-+]')
df['e'] = tmp.apply(lambda x: x[0])
df['a'] = tmp.apply(lambda x: x[1] if len(x) > 1 else np.nan)
del(tmp) 
AResem
  • 139
  • 5
0

Regex to retain - ve sign

import pandas as pd 
import re

df1 = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
data = [[i] + re.findall('-*[0-9.]+', i) for i in df1['col']]

df = pd.DataFrame(data, columns=["col", "left", "right"])

print(df.head())
col left right
0      1+2    1     2
1      3+4    3     4
2       20   20  None
3  0.6-1.6  0.6  -1.6

[Program finished]
Subham
  • 397
  • 1
  • 6
  • 14