-1

I have this code to remove all rt (or retweets) in a given a dataseries. However, this does not work as I still see rt everywhere.

def pre_process(text):

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

The second line of the function is where I find the most problematic.

I tried:

newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'rt', "")

but it removed all rt, making deport - depo and part - pa

Thanks a lot.

I have this sample screenshot of the data:

enter image description here

Apologies for the delayed upload of a sample file: https://docs.google.com/spreadsheets/d/1CQm-fXdGGrCw6JJzbm9u1sgK4Nc6QTDIrWuiL2b2ZGI/edit?usp=sharing

As you can see in the file, I picked different patterns such as:

RT:
RT@
RT 
RT inside the sentence

I also made sure that there are some words like depo rt, pa rt, a rt icle and others to visualise my problem correctly.

Thank you very much.

Mtrinidad
  • 157
  • 1
  • 11
  • 2
    @anky My bad... I didn't pick up on the `dataseries` in the question. Comment removed and voted to reopen – Nick Apr 26 '20 at 04:11
  • 2
    `newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'(?i)(rt)','')` seems to work for me. not sure why you need the word boundary , you should include an example to show what doesn't work, also may be you can take a look at `stemming` , just guessing – anky Apr 26 '20 at 04:13
  • @anky thanks! but this also removed all rt not just the standalone word 'rt'. It removed the rt in part, rt in deport, rt in sort. I will post some example. – Mtrinidad Apr 26 '20 at 04:17
  • then may be try `.str.replace(r'(?i)\brt\b','')` – anky Apr 26 '20 at 04:19
  • 1
    It might help to post some sample text that failing (not a picture of it). – Todd Apr 26 '20 at 04:47
  • They're all failing. They all have the characters 'rt' one way or the other. – Mtrinidad Apr 26 '20 at 04:52
  • The input text... sample text so we can try to reproduce this.. – Todd Apr 26 '20 at 05:09
  • Sorry for misunderstanding! I will do it now – Mtrinidad Apr 26 '20 at 05:39
  • According to the documentation the first parameter needs to be a _compiled_ regex to work as a regex - the `r'` syntax is just for easier typing of raw strings - it doesn't compile a regex. Could you try `re.compile(r'\brt\b')` and give that to `replace` instead of your regex directly? – MatsLindh Apr 26 '20 at 06:58
  • There is clearly something else other than what you posted. The issue is not repro. Probably, you run the replacement on one `df`, but then display some other `df`. – Wiktor Stribiżew Apr 26 '20 at 09:31
  • Add the sample text as actual text instead of an image so we can reproduce this ... – compuphys Apr 26 '20 at 10:00

2 Answers2

1

Suggestion

1.) Rewrite your user function like this:

def pre_process(s):
    s = s.str.lower()
    s = s.str.replace(r'\brt\b', "")
    s = s.replace(r'@\w+', '', regex=True)
    s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

    return s

2.) Call the user function with the DataFrame series you want to process as the parameter object:

newdataset['tidytext'] = pre_process(newdataset['text'])

Returns:

ap my troops arrest ro suspects 6 buddhists killed httpapnewszqzoyhz 
my troops arrest ro suspects six buddhists killed accused httpnewspaperstread111326479 august 05 2017 at 0652pm ussupportll
my govnt probe finds no campaign of abuse against ro httpowlymdqb50dhfmk
my rejects allegations of human rights abuses against ro httpreutrs2wwuepg httptwittercomreutersstatus894153592306884608
this is part of a bigger problem we don’t need to deport them
north of ny is a good place to move into
this article is very sensationalist
you cant just  all of my tweetssome are part of a bigger storye18
calls for aearly morning prayer please 

Proof

After reviewing your sampletweets, I think the issue is how you are calling the methods, not an issue with regex.

In the user function pre_process(text) the internal method calls that reference actions on the dataframe series are within the scope of the user function.

By user function I mean the code you shared:

def pre_process(text):

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

By internal method calls I mean:

    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
    newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

The actions taken within the user function don't operate on objects outside of the user function because you haven't connected the input variable to the internal actions.

One way to fix this, rewrite the actions within the user function to act on the variable passed to the user function (like suggested above):

def pre_process(s):
    s = s.str.lower()
    s = s.str.replace(r'\brt\b', "")
    s = s.replace(r'@\w+', '', regex=True)
    s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)

    return s

With this user function, the internal method calls are acting on the variable s; so when the DataFrame series object is passed to the user function pre_process, the internal method calls act on it.

Make sure we pass the object back to the user function call by adding the line return s within the user function.

Now we can set a new column (i.e. series) to the function call in order to create a new column of the processed column with:

newdataset['tidytext'] = pre_process(newdataset['text'])

Hope it helps!

jameshollisandrew
  • 1,143
  • 9
  • 10
-1

The regex should to be compiled before given to replace if you're using a pandas version before 0.23.0 (after 0.23.0 it depends on the regex parameter to replace).

The replace method is a specific str method for pandas series, it's not a direct match to the native str.replace method in Python.

I recommend compiling the regex outside of the function so you can re-use the compiled regex across invocations:

rt_regex = re.compile(r'\brt\b')

def pre_process(text):
    newdataset['tidytext'] = newdataset['text'].str.lower()
    newdataset['tidytext'] = newdataset['tidytext'].str.replace(rt_regex, "")
MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • 1
    See the docs you referred to: *"pat : **str** or compiled regex String **can be a character sequence** or regular expression."* – Wiktor Stribiżew Apr 26 '20 at 09:30
  • This will depend on the version; for any version of pandas prior to `0.23.0`, the user has to explicitly compile the regex themselves first. From 0.23.0 it is compiled as a regex by pandas as long as the `regex` parameter is True. – MatsLindh Apr 26 '20 at 10:01
  • That is still not explaining why `r'\brt\b'`, which is actually `\brt\b` string, still affects words like `deport`. It is clear that OP code cannot result in the results described, be it a compiled regex or not. Hence, your answer is not revelant for the *current question*. – Wiktor Stribiżew Apr 26 '20 at 10:04
  • There is nothing in the question that says that `\brt\b` affects `deport`. The case where it does, is where the replacement pattern is _only_ `rt`, which seems to indicate it's not being parsed as a regex. If OP can expand and add a text case I'll delete the answer if it answers the wrong question. – MatsLindh Apr 26 '20 at 11:24
  • See "*but it removed all rt, making deport - depo and part - pa*" – Wiktor Stribiżew Apr 26 '20 at 11:25
  • Yes, which corresponds to the method call `.str.replace(r'rt', "")`. No word boundary marks there. – MatsLindh Apr 26 '20 at 11:26
  • Correct, but OP used `r'\brt\b'`, so that output is not due to the code OP posted. We do not know what OP has used in fact, we do not see the rest of the code. – Wiktor Stribiżew Apr 26 '20 at 11:27