0

I have been trying to do topic modeling for articles. I cleaned the raw data which contains a lot of backslash and numbers. Even after removing the punctuations, backslash, and numbers, but I got the backslash along with numbers in top terms in topic 1. The code snippet which I used for the preprocessing is

articles <- tm::tm_map(articles, content_transformer(tolower))
# Remove numbers
articles<- tm_map(articles, removeNumbers)
# Remove english common stopwords
articles<- tm_map(articles, removeWords, stopwords("english"))
# Remove punctuations
articles<- tm_map(articles, removePunctuation)
# Eliminate extra white spaces
articles <- tm_map(articles, stripWhitespace)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
articles <- tm_map(articles,toSpace, "\\\\" )

Even after trying to clean the data, I got the backslash and numbers in top terms in topics, design
robot
class
medical
device wkh\003
students
dcbl
ri\003
course

The backslash and the numbers in the topics are totally inappropriate. Kindly help me with a solution

  • Have a look at creating reproducible examples as well for future questions: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – william3031 Aug 24 '21 at 00:34

1 Answers1

0

You can use the stringr package. For example:

library(tidyverse)

df <- tibble(text = c("robot", "class", "medical", "device wkh\\003", "students", "dcbl", "ri\\003", "course", NA))


df %>% 
  mutate(text = str_remove_all(text, "\\\\"))
  
# A tibble: 9 × 1
  text         
  <chr>        
1 robot        
2 class        
3 medical      
4 device wkh003
5 students     
6 dcbl         
7 ri003        
8 course       
9 NA  
william3031
  • 1,653
  • 1
  • 18
  • 39