-1

I want to remove the whole tweet or a row from a data-frame if it contains any non-english word. My data-frame looks like

     text
1  | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!  
     JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>

2  | @natefrancis00 @SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F 
     <U+0086> if only Alan had a Twitter hahaha

3  | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too 
     far now
4  | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5  | One word #Shame on you! #Ji allowing looters to become president

The expected dataframe should be like this:

 text
3  | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too 
     far now
4  | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5  | One word #Shame on you! #Ji allowing looters to become president.
lmo
  • 37,904
  • 9
  • 56
  • 69
Mahnoor
  • 45
  • 1
  • 6

1 Answers1

0

You want to preserve the alpha-numeric characters along with some of punctuation's like @, ! etc.
If your column contains mainly of <unicode>, then this should do:

For data frame df with text column, using grep:

new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]

To put it back to your original column text. you can just work with indices that you need and put other to NA:

idx <-  grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA 

For cleaning the tweet, you can use gsub function. refer this post cleaning tweet in R

Mankind_008
  • 2,158
  • 2
  • 9
  • 15
  • I want to delete whole text in case if any is present in that text along- with English/foreign language present in those texts. I have mentioned that I want just those rows which are free/pure from this . – Mahnoor Jun 11 '18 at 17:14
  • the solution you've told it just removes that can also be removed by using regex but I want to remove whole row if text column contains any unicode. – Mahnoor Jun 11 '18 at 17:16
  • also removed the rows with `""`. check it now. – Mankind_008 Jun 11 '18 at 17:43
  • Wow it works like charm :) Thanks dear. But one problem how do I replace this with the old column in dataframe. As it gives me 48 entries but the old column have 56. – Mahnoor Jun 11 '18 at 17:53
  • As when I remove unicode with regex as to convert the column in to corpus then it don't remove the row but just the unicodes. Then, I convert corpus back to dataframe and replace this column with the previous dataframes column their rows are equal as no row is removed and it appears as column name text.text but after that if I save it in csv file the text column contains combinations of mathematical values in it not the text of the new datarame or not the old one – Mahnoor Jun 11 '18 at 18:09
  • So, kindly tell me how to deal with row removal problem. – Mahnoor Jun 11 '18 at 18:09
  • then you can just work with indices of rows you need and put other to `NA`. updated above – Mankind_008 Jun 11 '18 at 18:16
  • Thanks dude thanks a ton! You really solved my huge problem. Much prayers :) – Mahnoor Jun 11 '18 at 18:49
  • One more thing can u please tell how do I remove the mentions if it appears before the start of text. – Mahnoor Jun 11 '18 at 18:49
  • like the above text examples: from column 3 in text I just want : "Whoa Whoa I just not take this @Mankind_008 too far now." – Mahnoor Jun 11 '18 at 18:53
  • Removing the starting mentions but staying with the mentions in between the text. – Mahnoor Jun 11 '18 at 18:53
  • you can use the `gsub` function, i added the link to post, refer that, its quite helpful. – Mankind_008 Jun 11 '18 at 19:13
  • That's great! And sorry for bothering once again. Please guide me in this. My dataset is remained with a collection of tweets/text related to my topic and some aren't. So, how can I get rid of unrelated tweets? I can do it manually but it's too hectic. Kindly help! – Mahnoor Jun 11 '18 at 19:33