4

I would like to remove characters between [] and currently I am doing

df['Text'] = df['Text'].str.replace(r"\[.*\]","")

But the output isn't desirable. Before it is [image] This document and after it is ******* This document where * is whitespace.

How do I get rid of this white space.

Edit 1

The Text column of df looks like below:

ID    Text
0     REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5     Lease AureementMade and signed on the \ of Aug...
6     FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8     [image: image0.jpg] Jack[image: image1.jb2] ...
9     [image: image0.jpg] ABC SALES Meeting 97...
14    FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17    [image: image0.tif] Deep ML LEASE SERVI...
22    [image: image0.jpg] F 15 083 EX [image: image1...
26    LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28    [image: image0.jpg] 17. Medical VERIFICATION...
31    [image: image0.jpg]  [image: image1.jb2] PLL 3...
32    SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34    [image: image0.tif] Lease Agreement May 10, 20...
35    13057968.3  1 Initials:  _____  _____  SECOND ...
42    [image: image0.jpg] Jack Dowson Buy Real MI...
46     Deep – Machine Learning LEASE   B...

I would like to see

ID    Text
0     REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5     Lease AureementMade and signed on the \ of Aug...
6     FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8     Jack ...
9     ABC SALES Meeting 97...
14    FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17    Deep ML LEASE SERVI...
22    F 15 083 EX ...
26    LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28    17. Medical VERIFICATION...
31    PLL 3...
32    SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34    Lease Agreement May 10, 20...
35    13057968.3  1 Initials:  _____  _____  SECOND ...
42    Jack Dowson Buy Real MI...
46    Deep – Machine Learning LEASE   B...
chintan s
  • 6,170
  • 16
  • 53
  • 86
  • 3
    Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on how to ask a good question may also be useful. – yatu Jul 03 '19 at 12:52
  • 2
    `df['Text'] = df['Text'].str.replace(r"\[.*\]","").str.strip()`? – Rakesh Jul 03 '19 at 12:55
  • If I use @Rakesh 's solution, it removes the entire row. – chintan s Jul 03 '19 at 13:00

2 Answers2

6

Looks like you need .str.strip()

Ex:

df = pd.DataFrame({"ID": [1,2,3], "Text": ["[image: 123.jpg] This document", "[image: image.jpg] Readers of the article", "The agreement between [image: image.jpg] two parties"]})
df["Text"] = df["Text"].str.replace(r"(\s*\[.*?\]\s*)", " ").str.strip()
print(df)

Output:

0                        This document
1               Readers of the article
2    The agreement between two parties
Name: Text, dtype: object
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • 1
    Note that there are **two** spaces between words *between* and *two*, so this proposition doesn't do the job. *str.strip()* removes leading and trailing spaces from the **whole** text, not before/after each match. – Valdi_Bo Jul 03 '19 at 13:08
  • @Valdi_Bo. Thanks did not see that. – Rakesh Jul 03 '19 at 13:11
4

Add optional space (?) to your regex, so the whole regex (match part) should be:

r'\[.*\] ?'

Another hint: Your regex is enclosed in parentheses (a capturing group). They are not needed. Remove them.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41