-2

I have a pandas dataframe column value as

"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"

I need to trim it from the back,i.e my resultant value should be AS7878788.

I am doing the below:

newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]

This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"

Can someone help please?

sayan sen
  • 31
  • 5
  • 3
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jul 10 '18 at 09:10

2 Answers2

1

Let's try first with a regular string in pure Python:

x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"

res = x.rsplit('(', 1)[-1][:-1]  # 'AS7878788'

Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.

You can then apply this in Pandas via pd.Series.str methods:

df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]

Here's a demo:

df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})

df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]

print(df)

         col
0  AS7878788

Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • You could do it with a regex too - something like `df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]` as a brute-force first try – asongtoruin Jul 10 '18 at 09:16
  • @asongtoruin, Good point! Unfortunately, I don't know regex, so I would encourage you to post as a separate answer :). Also, I *believe* standard Python string methods are more efficient than regex. I'm not sure if this is also reflected in Pandas performance. – jpp Jul 10 '18 at 09:17
  • I think you're right about standard methods being more efficient, but regex gives you more flexibility. For example, your answer would return a closing bracket if there was any character after the final closing bracket – asongtoruin Jul 10 '18 at 09:22
  • @asongtoruin, Yup, entirely agree with that too. I'll add a disclaimer – jpp Jul 10 '18 at 09:23
1

You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:

df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
                           'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})

and we do:

df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]

this gets us:

         col
0  AS7878788
1  abjsdfvhg

To explain what the regex is doing, it is trying to find all instances where we have:

\(             # an open bracket
([^\(^\)]+)    # anything that isn't an open bracket or a close bracket for one or more characters
\)             # a close bracket

We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:

                    col
0  [12tytyt, AS7878788]
1  [abjhsgf, abjsdfvhg]
asongtoruin
  • 9,794
  • 3
  • 36
  • 47