45

I have a dataframe where I want to remove all parentheses and stuff inside it.

I checked out : How can I remove text within parentheses with a regex?

Where the answer to remove the data was

re.sub(r'\([^)]*\)', '', filename)

I tried this as well as

re.sub(r'\(.*?\)', '', filename)

However, I got an error: expected a string or buffer

When I tried using the column df['Column Name'] I got no item named 'Column Name'

I checked the dataframe using df.head() and it showed up as a clean table with the column names as what I wanted them to be....however when I use the re expression to remove the (stuff) it isn't recognizing the column name that I have.

I normally use

df['name'].str.replace(" ()","") 

However, I want to remove the parentheses and what is inside....How can I do this using either regex or pandas?

Thanks!

Here is the solution I used...thanks for the help!

All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\(.*\)","")
Community
  • 1
  • 1
Alexis
  • 8,531
  • 5
  • 19
  • 21
  • 1
    Please post `df.columns`. – unutbu Jan 03 '14 at 00:43
  • It would be good if someone try to give an example here, can any1 tell, is this question is about data frame, in which column having values like 141(32), 982(21),9083(1231) so after removing brackets we'll get data like 141,982,9083 ?? So please reply me so that I or more like me can understand this one. – Akshay Singh Jan 08 '21 at 15:24

4 Answers4

77
df['name'].str.replace(r"\(.*\)","")

You can't run re functions directly on pandas objects. You have to loop them for each element inside the object. So Series.str.replace((r"\(.*\)", "") is just syntactic sugar for Series.apply(lambda x: re.sub(r"\(.*\)", "", x)).

dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • 1
    What if I wanted to include a space before and after the parenthesis? – B Furtado Dec 14 '16 at 22:32
  • 3
    @B_Furtado: `" \(.*\) "` or `"\s\(.*\)\s"`, assuming you __always__ have one space either side. If you want to replace any number of spaces either side (including none), use `"\s*\(.*\)\s*"`. You should learn **regex** syntax, it isn't too difficult for simple matches like this. Check https://docs.python.org/3/library/re.html. – dmvianna Dec 14 '16 at 22:39
  • Thanks @dmvianna. I keep learning, then I never use and I forget... Best, – B Furtado Dec 14 '16 at 22:44
  • I ahve been playing a bit with r"\(.*\)" and I do not understand the difference between the later expression and r"\(.\)". In principle the '.' refers to any character inside the parenthesis so why do we need '*' ? – JPV Feb 15 '17 at 10:04
  • 1
    @R_user: '.' Refers to any character, and without another parameter will refer to a single character. '\*' signals the previous character (any) can be found zero to any times. So in 'abc' '.' Will match 'a' (one character), while '.\*' will match 'abc' (all characters). – dmvianna Feb 15 '17 at 17:55
  • What is the 'r' for? It seems to work fine without the r? – Esteban Apr 08 '20 at 16:52
  • It throws warning though - `FutureWarning: The default value of regex will change from True to False in a future version.` – Dev Jul 07 '22 at 12:12
11

If you have multiple (...) substrings in the data you should consider using either

All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\(.*?\)", "", regex=True)

or

All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\([^()]*\)", "", regex=True)

The difference is that .*? is slower and does not match line breaks, and [^()] matches any char but ( and ) and is quite efficient and matches line breaks. The first one will match (...(...) but the second will only match (...).

If you want to normalize all whitespace after removing these substrings, you may consider

All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\s*\([^()]*\)", "", regex=True).str.strip()

The \s*\([^()]*\) regex will match 0+ whitespaces and then the string between parentheses and then str.stip() will get rid of any potential trailing whitespace.

NOTE on regex=True:

Acc. to Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
6

#removing the unwanted characters

Energy['Country'] = Energy['Country'].str.replace(r" \(.*\)","")

Blockquote

Energy['Country'] = Energy['Country'].str.replace(r"([0-9]+)$","")

this are ways you may also remove the unwanted errors

0

All the answers above seem great; However, the following links provide a better understanding. a) https://docs.python.org/3/howto/regex.html#regex-howto b) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

To summarize, to replace a unwanted character, you have to use the pandas.DataFrame.replace function. For instance to remove [] from a dataframe, one can do the following.

import re
p=re.compile('\[]') %% regular expression for matching [] (see reference (a)
result.replace(to_replace=p,value="",inplace=False,regex=True) %%For a dataframe named result, this way one can replace [] with "". see reference (b)