1

I need advice on how to proceed when slicing a string with an explanation.

I have in dataframe column:

data
(0,1), (1,2)

And I would like to divide it into this form.

1 2
(0,1) (1,2)

How to split this string correctly?

When I use this:

.str.split(',', expand=True)

, it also divides my string between parentheses, but I don't want to. How to do this correctly (and an explanation please)?

SeaBean
  • 22,547
  • 3
  • 13
  • 25
Cesc
  • 274
  • 2
  • 14
  • Added explanation on the solution using `str.split()`. This solution is a tweak on your code, to ensure only split on the comma between tuples instead of within a tuple. – SeaBean Aug 02 '21 at 07:28
  • 1
    @SeaBean Yes you're right. For the solution, I had to tweak my data a bit to make it work. It's a fact that I was inquiring about a dataframe. I modified the solution label. – Cesc Aug 02 '21 at 07:28

7 Answers7

2

You can use str.extract() with regex, as follows:

df['data'].str.extract(r'(\(\d+,\s*\d+\))\s*,\s*(\(\d+,\s*\d+\))')

or use str.split(), as follows:

df['data'].str.split(r'(?<=\))\s*,\s*', expand=True)

Here we use regex positive lookbehind to look for a closing parenthesis ) before comma , for the comma to match. Hence, we only split on the comma between tuples and not within tuples.

Result:

       0      1
0  (0,1)  (1,2)
SeaBean
  • 22,547
  • 3
  • 13
  • 25
1

You can use eval.

tuple_str = "(0,1), (1,2)"
my_tuple = eval(tuple_str)
print(my_tuple)
>>> ((0, 1), (1, 2))

Read more about eval here.

Sajad
  • 492
  • 2
  • 10
1

You can try this :

import pandas as pd

df=pd.DataFrame({"data":['(0,1), (1,2)']})

new_df=pd.DataFrame(df.data.str.split(", ").tolist())
print(new_df)
"""
           data
0  (0,1), (1,2)

       0      1
0  (0,1)  (1,2)
"""

We are splitting "data" column using , , we converted that into list and we are making new DataFarme using that data.

imxitiz
  • 3,920
  • 3
  • 9
  • 33
1

Also using regex as other anwser, but you can use re.split

import re

str='(0,1), (1,2),(3,4)' 
re.split('(?<=\)) *, *(?=\()', str) #['(0,1)', '(1,2)', '(3,4)']

like String.split, re.split will split string but using regex as delimiter re.split document can be found here: https://docs.python.org/3/library/re.html#re.split

regex I use come from this answer. Regular Expression to find a string included between two characters while EXCLUDING the delimiters

datlt
  • 17
  • 2
1

Use regex \(\d+,\s*\d+\) to match two comma separated numbers enclosed by parenthesis, pass this regex to str.findall then apply pd.Series. It will create new columns with the values that match the pattern.

df['data'].str.findall('\(\d+,\s*\d+\)').apply(pd.Series)
       0      1
0  (0,1)  (1,2)
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
0

You may try regexs:

import re
r=re.findall(r'\(\d+,\d+\)','(0,1),(1,2)')
print(r) # ['(0,1)', '(1,2)']

re.findall means finding all strings matching the regex (first argument) within the haystack (second argument).

The regex given means to match a pair of () with two numbers (\d+) seperated by a ,.

Or if you want a more extendable version,swap out the second line with
r=re.findall(r'\(.*?\)','(0,1),(1,2)')

The .*? means to match any number of charctors but try matching as little as possible.

xkcdjerry
  • 965
  • 4
  • 15
0

You can use the following regex with Series.str.split:

import pandas as pd
df = pd.DataFrame({'data': ['(0,1), (1,2)']})
df2 = df['data'].str.split(r'\s*,\s*(?![^()]*\))', expand=True)

Output of df2:

       0       1
0  (0,1)   (1,2)

See the regex demo. Details:

  • \s*,\s* - a comma enclosed with zero or more whitespaces
  • (?![^()]*\)) - a negative lookahead that fails the match if, immediately to the right of the current location, there are zero or more chars other than ( and ) and then a ) char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563