0

I want to remove punctuations, numbers and http links in text from data.frame file. I tried tm, stringr, quanteda, tidytext packages but none of them worked. I m looking for a useful basic package or function for clean data.frame file without convert it to corpus or something like that.

How can I do it?

mycorpus <- tm_map(mycorpus, content_transformer(remove_url)) Warning message: In tm_map.SimpleCorpus(mycorpus, content_transformer(remove_url)) : transformation drops documents

mycorpus <- tm_map(mycorpus, removePunctuation) Warning message: In tm_map.SimpleCorpus(mycorpus, removePunctuation) : transformation drops documents

And, when I try to see some tweets which contains any symbol: Error in nchar(output) : invalid multibyte string, element 1

mycorpus <- tm_map(mycorpus, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input

Fatih Bayrak
  • 13
  • 1
  • 3
  • 1
    What *exactly* have you tried? Please [see here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R post we can help with. That includes a representative sample of data, code that hasn't worked, and expected output. – camille Jul 29 '18 at 16:43
  • 1
    Welcome to SO. it is always recommended to post samples of Input and expected output in your post with code tags. – RavinderSingh13 Jul 29 '18 at 16:50
  • > mycorpus <- tm_map(mycorpus, content_transformer(remove_url)) Warning message: In tm_map.SimpleCorpus(mycorpus, content_transformer(remove_url)) : transformation drops documents > mycorpus <- tm_map(mycorpus, removePunctuation) Warning message: In tm_map.SimpleCorpus(mycorpus, removePunctuation) : transformation drops documents And, when I try to see some tweets which contains any symbol: Error in nchar(output) : invalid multibyte string, element 1 > mycorpus <- tm_map(mycorpus, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input – Fatih Bayrak Jul 29 '18 at 19:49
  • Please provide a shortened example of your data we can work with. Otherwise we have to keep guessing. – Manuel Bickel Jul 29 '18 at 20:06
  • You might take another look at unnest_tokens from tidytext, which now has a token = "tweets" option that may be a good fit for you. It has options including strip_punct = TRUE and strip_url = TRUE. – Julia Silge Aug 15 '18 at 00:21

2 Answers2

4

Since you haven't posted any sample input or sample output so couldn't test it, for removing punctuation, digits and http links from your data frame's specific column you could try following once.

gsub("[[:punct:]]|[[:digit:]]|^http:\\/\\/.*|^https:\\/\\/.*","",df$column)

OR as per Rui's suggestion in comments use following too.

gsub("[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)","",df$column)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

A concise version may be achieved if you aim at keeping only characters as follows by replacing everything that is not a character. Furthermore, I guess that you want to replace it by a blank because you mentioned something about corpus. Otherwise your addresses will be collapsed to noe long string (but maybe that is what you want - as stated you might provide an example).

x = c("https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r"
      , "http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r")

gsub("\\W|\\d|http\\w?", " ", x, perl = T)
# [1] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"
# [2] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"

 the same task for a data frame of  100000 rows
# make sure that your strings are not factors
df = data.frame(id = 1:1e5, url = rep(x, 1e5/2), stringsAsFactors = FALSE)
# df before replacement
df[1:4, ]
# id    url
# 1  1 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 2  2  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 3  3 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 4  4  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# apply replacement on a specific column and assign result back to this column
df$url = gsub("\\W|\\d|http\\w?", " ", df$url, perl = T)
# check output
df[1:4, ]
# id        url
# 1  1     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 2  2     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 3  3     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 4  4     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • I can not do it because my data have 86909 row. when I use gsub R try to convert all data in the console like # [1] ... ... ... And program is crashing. So I need a solution that remove all punctuations in the data.frame itself – Fatih Bayrak Jul 29 '18 at 23:19
  • updated my answer to show how you would apply replacements if you have a data.frame of 100000 rows, this only takes seconds – Manuel Bickel Jul 30 '18 at 10:30