0

I am reading a csv file for applying NLP and I am trying to pre-process the data. I have received data from an online forum, therefore, there are quotes on it. How can remove them? As an example;

a='[b]Re:[/b] 
[quote="xxx"] How can I do that blah blah xxx [/quote]
 Hello xxx, I will tell you how you can do it blah blah blah.'

I want the form like below;

a='Hello xxx, I will tell you how you can do it blah blah blah.'

I wanna regex that detects [quote=" and started to delete until it sees [/quote]. Is this possible?

I have tried this, but it did not work.

  def quotes(text):
   return re.sub('\[([^\]=]+)(?:=[^\]]+)?\].*?\[\/\\1\]', '', text)

  data['message'] = data['message'].apply(quotes)
nurlubanu
  • 71
  • 1
  • 1
  • 5
  • 1
    re.sub() replaces a pattern that it finds by something else - your pattern does not match the given text - that is why it did not work. Use http://regex101.com switched to python to develop a matching pattern. – Patrick Artner Jul 03 '19 at 15:39

2 Answers2

0

Here is a solution which seems to work:

a = '[b]Re:[/b] [quote="xxx"] How can I do that blah blah xxx [/quote] Hello xxx, I will tell you how you can do it blah blah blah.'
output = re.sub('\[([^\]=]+)(?:=[^\]]+)?\](.*?)\[\/\\1\]', '\\2', a)
print(output)

This prints:

Re:  How can I do that blah blah xxx  Hello xxx, I will tell you how you can do it blah blah blah.

The regex pattern is a bit verbose, but all it is doing is to match each set of tags, e.g. [quote="xxx"]...[/quote], remove them, and then replace with just whatever content was inside the tags.

\[([^\]=]+)(?:=[^\]]+)?\]  match an opening tag, and capture the tag name in \1
(.*?)                      match and capture in \2 all the content
\[\/\\1\]                  match a closing tag, using the backreference \1

Note that re.sub by default will do a global replacement, so once we have a working pattern for a single set of tags, it can be applied everywhere.

Edit:

If you actually want to match and delete the entire tags with their contents, then use this:

output = re.sub('\[([^\]=]+)(?:=[^\]]+)?\].*?\[\/\\1\]', '', a)
print(output)

This prints:

Hello xxx, I will tell you how you can do it blah blah blah.
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you for reply Tim, it is perfectly working on a string. However, I have tried to adapt to my code but I could not. I read csv file as **data** and I want to apply this regex for all lines in 'message' column. I have tried this but it did not work, for text in data['message']: re.sub('\[([^\]=]+)(?:=[^\]]+)?\].*?\[\/\\1\]', '', text) I have tried this; data['message'] = data['message'].re.sub('\[([^\]=]+)(?:=[^\]]+)?\].*?\[\/\\1\]', '',) the error says "AttributeError: 'Series' object has no attribute 're'" – nurlubanu Jul 04 '19 at 13:08
  • I can't fix a problem which I can't see, and I don't see a problem here. Maybe edit your question and information which explains the additional complexity you are having in your code. – Tim Biegeleisen Jul 04 '19 at 13:09
  • I am a new member at stackoverflow, I was typing enter at the end of my sentences. Therefore my message was missing. Now I have editted. Sorry. – nurlubanu Jul 04 '19 at 13:12
  • Please read [applying regex to a pandas dataframe](https://stackoverflow.com/questions/25292838/applying-regex-to-a-pandas-dataframe). I won't edit my answer, because we are getting too broad now. – Tim Biegeleisen Jul 04 '19 at 13:15
  • Hey Tim, it does not work because I just put in the wrong form. There are 1 line between [b]Re:[/b] , [quote="xxx"] and Hello xxx. I edited the question. – nurlubanu Jul 04 '19 at 14:13
0

The answer is too simple actually,

def quotes(text):
 return re.sub(r'\[quote.+quote\]','',text)
data['message'] = data['message'].apply(quotes)

Just that.

nurlubanu
  • 71
  • 1
  • 1
  • 5