0

I have a dataset with 100,000 rows and 300 columns

Here is the sample dataset:

    EVENT_DTL
0   8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family

1   8. Background : Engineer / living with his mom marriage_married

How can I remove the white blank between ‘with’ and ‘marriage_virgin’ but leave only one white blank?

Desired outout would be:

        EVENT_DTL
    0   8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
    
    1   8. Background : Engineer / living with his mom marriage_married

2 Answers2

4

You can use pandas.Series.str to replace "\s+" (1 or more whitespace) by a single whitespace.

Try this :

df["EVENT_DTL"]= df["EVENT_DTL"].str.replace("\s+", " ", regex=True)

Output :

print(df)
                                                                                                   EVENT_DTL
0  8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1  8. Background : Engineer / living with his mom marriage_married

If you need to clean up the whole dataframe, use pandas.DataFrame.replace :

df.astype(str).replace("\s+", " ", regex=True, inplace=True)
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • 2
    I always thought `DataFrame.str` methods were replicas of the standard library string methods, so I didn't think to check the api. This looks much cleaner than using `.apply` with a `re.sub`. – Tzane Dec 01 '22 at 10:59
  • Yeah, pandas _built-in_ functions are nice and faster. – Timeless Dec 01 '22 at 11:01
  • brilliant method – Chung Joshua Dec 01 '22 at 12:58
  • my original dataset has `\n` before the numbers `8.` and `9.`. When I run this code, seems like it removes the `\n ` also. Why? – Chung Joshua Dec 01 '22 at 13:25
  • I don't think this is due to the regex pattern but to your data itself. Is it a literal `\n` or a real line break ? Can you give an example and/or a screenshot ? – Timeless Dec 01 '22 at 13:29
1

You can call string methods for a DataFrame column with

df["EVENT_DTL"].str.strip()

but .strip() doesn't work, because it only removed extra characters from the start and end of the string. To remove all duplicate whitespaces you can use regex:

import re
import pandas as pd

d = {"EVENT_DTL": [
    "8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family",
    "8. Background : Engineer / living with his mom marriage_married"
]}
df = pd.DataFrame(d)
pattern = re.compile(" +")
df["EVENT_DTL"] = df["EVENT_DTL"].apply(lambda x: pattern.sub(" ", x))
print(df["EVENT_DTL"][0])
Tzane
  • 2,752
  • 1
  • 10
  • 21