Suggestion
1.) Rewrite your user function like this:
def pre_process(s):
s = s.str.lower()
s = s.str.replace(r'\brt\b', "")
s = s.replace(r'@\w+', '', regex=True)
s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)
return s
2.) Call the user function with the DataFrame series you want to process as the parameter object:
newdataset['tidytext'] = pre_process(newdataset['text'])
Returns:
ap my troops arrest ro suspects 6 buddhists killed httpapnewszqzoyhz
my troops arrest ro suspects six buddhists killed accused httpnewspaperstread111326479 august 05 2017 at 0652pm ussupportll
my govnt probe finds no campaign of abuse against ro httpowlymdqb50dhfmk
my rejects allegations of human rights abuses against ro httpreutrs2wwuepg httptwittercomreutersstatus894153592306884608
this is part of a bigger problem we don’t need to deport them
north of ny is a good place to move into
this article is very sensationalist
you cant just all of my tweetssome are part of a bigger storye18
calls for aearly morning prayer please
Proof
After reviewing your sampletweets, I think the issue is how you are calling the methods, not an issue with regex.
In the user function pre_process(text)
the internal method calls that reference actions on the dataframe series are within the scope of the user function.
By user function I mean the code you shared:
def pre_process(text):
newdataset['tidytext'] = newdataset['text'].str.lower()
newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)
By internal method calls I mean:
newdataset['tidytext'] = newdataset['text'].str.lower()
newdataset['tidytext'] = newdataset['tidytext'].str.replace(r'\brt\b', "")
newdataset['tidytext'] = newdataset['tidytext'].replace(r'@\w+', '', regex=True)
newdataset['tidytext'] = newdataset['tidytext'].replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)
The actions taken within the user function don't operate on objects outside of the user function because you haven't connected the input variable to the internal actions.
One way to fix this, rewrite the actions within the user function to act on the variable passed to the user function (like suggested above):
def pre_process(s):
s = s.str.lower()
s = s.str.replace(r'\brt\b', "")
s = s.replace(r'@\w+', '', regex=True)
s = s.replace(r'[!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]', '', regex=True)
return s
With this user function, the internal method calls are acting on the variable s
; so when the DataFrame series object is passed to the user function pre_process
, the internal method calls act on it.
Make sure we pass the object back to the user function call by adding the line return s
within the user function.
Now we can set a new column (i.e. series) to the function call in order to create a new column of the processed column with:
newdataset['tidytext'] = pre_process(newdataset['text'])
Hope it helps!