2

I am new to text-mining in R. I want to remove stopwords (i.e. extract keywords) from my data frame's column and put those keywords into a new column.

I tried to make a corpus, but it didn't help me.

df$C3 is what I currently have. I would like to add column df$C4, but I can't get it to work.

df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L, 
       10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone", 
       "hope you all", "I Hope", "I need help", "In life", "It would work", 
       "On Text-Mining", "Thanks"), class = "factor"), C4 = structure(c(2L, 
       4L, 1L, 6L, 3L, 7L, 5L, 9L, 8L, 3L), .Label = c("doing good", 
       "everyone", "help", "hope", "Hope", "life", "Text-Mining", "Thanks", 
       "work"), class = "factor")), .Names = c("C3", "C4"), row.names = c(NA, 
       -10L), class = "data.frame")

head(df)
#               C3          C4
# 1 hello everyone    everyone
# 2   hope you all        hope
# 3 Are doing good  doing good
# 4        In life        life
# 5    I need help        help
# 6 On Text-Mining Text-Mining
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Fraxxx
  • 114
  • 1
  • 11
  • 2
    Show us what you have done so far. Look at the `tidytext` package and search for it on the net to get inspiration if you haven't started yet. – ekstroem Jul 28 '17 at 10:38

3 Answers3

1

This solution uses packages dplyr and tidytext.

library(dplyr)
library(tidytext)

# subset of your dataset
dt = data.frame(C1 = c(108,20, 999, 52, 400),
                C2 = c(1,3,7, 6, 9),
                C3 = c("hello everyone","hope you all","Are doing good","in life","I need help"), stringsAsFactors = F)

# function to combine words (by pasting one next to the other)
f = function(x) { paste(x, collapse = " ") }

dt %>%
  unnest_tokens(word, C3) %>%      # split phrases into words
  filter(!word %in% stop_words$word) %>%   # keep appropriate words
  group_by(C1, C2) %>%             # for each combination of C1 and C2
  summarise(word = f(word)) %>%    # combine multiple words (if there are multiple)
  ungroup()                        # forget the grouping

# # A tibble: 2 x 3
#        C1    C2  word
#      <dbl> <dbl> <chr>
#   1    20     3  hope
#   2    52     6  life

The problem here is that the "stop words" built in that package filter out some of the words you want to keep. Therefore, you have to add a manual step where you specify words you need to include. You can do something like this:

dt %>%
  unnest_tokens(word, C3) %>%      # split phrases into words
  filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>%   # keep appropriate words
  group_by(C1, C2) %>%             # for each combination of C1 and C2
  summarise(word = f(word)) %>%    # combine multiple words (if there are multiple)
  ungroup()                        # forget the grouping

# # A tibble: 4 x 3
#        C1    C2       word
#      <dbl> <dbl>      <chr>
#   1    20     3       hope
#   2    52     6       life
#   3   108     1   everyone
#   4   999     7 doing good
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • I am getting an Error: Invalid column specification, though I have selected the rights columns. I guess it's because of grouping other two columns. I just need to extract keywords from my C3 column and bind it with my old Data Frame. – Fraxxx Jul 28 '17 at 12:33
  • Not sure how your actual dataset looks like. What are the types of your columns? They should be character and not factor columns. – AntoniosK Jul 28 '17 at 12:46
  • It's a csv file and I have transformed it into a Data Frame. The specific column in which I am trying to extract the Keywords is called "TITLE". This column basically consist of long sentences and I am looking to extract the Keywords from it. This Column is not dependent on any other column of the DF. – Fraxxx Jul 28 '17 at 12:50
  • You should use that column name instead of C3 then. Do you still have C1 and C2? You have to adjust my code based on your column names. If you post few lines of your dataset I'll be able to do that for you. – AntoniosK Jul 28 '17 at 12:53
  • dt %>% unnest_tokens(word, dt$TITLE) %>% filter(!word %in% stop_words$word) %>% group_by(dt$ID1, dt$ID2) %>% summarise(word = f(word)) %>% ungroup() – Fraxxx Jul 28 '17 at 13:08
  • Try to remove the `dt$` before every column name. The pipe `%>%` of `dplyr` doesn't need that. This will solve the problem. – AntoniosK Jul 28 '17 at 13:25
  • Thanks for the solution. But now it is showing new error "Error in resolve_vars(new_groups, tbl_vars(.data)) : unknown variable to group by : x". – Fraxxx Jul 28 '17 at 13:45
  • How did you get that? It seems that the process is trying to group data based on a variable named x. Can you send me how your code looks like now? – AntoniosK Jul 28 '17 at 14:06
  • 1
    Thank you so much AntoniosK. You solved my Problem. Actually your code was perfectly fine. I have a very huge data and due to that, it was showing irrelevant errors. When I choose the first 50 rows, the code worked perfectly fine. Once again, thanks for the great help. – Fraxxx Jul 28 '17 at 14:12
  • One last question, I am not able to save the " dt " as a Data Frame after filtering it. I am getting all the desired columns as a tibble but I am not able to transform it into a Data Frame. – Fraxxx Jul 28 '17 at 15:47
  • The tibble you get at the end of the process does not affect you initial `dt`. If you want to save your final output as a dataframe first you have to transform it as a dataframe and then save it. You can add a `... %>% data.frame -> dt2` in the end of the process. Then you'll have your original `dt` and a new `dt2` which is your desired output as a dataframe. – AntoniosK Jul 28 '17 at 16:22
0

This is one of the first things I did in R, it may not be the best but something like:

library(stringi)

 df2 <- do.call(rbind, lapply(stop$stop, function(x){
    t <- data.frame(c1= df[,1], c2 = df[,2], words = stri_extract(df[,3], coll=x))
    t<-na.omit(t)}))

Example data:

 df = data.frame(c1 = c(108,20,99), c2 = c(1,3,7), c3 = c("hello everyone", "hope you all", "are doing well"))

 stop = data.frame(stop = c("you", "all"))

Then after you can reshapedf2 using:

df2 = data.frame(c1 = unique(u$c1), c2 = unique(u$c2), words = paste(u$words, collapse= ','))

Then cbind df and df2

Olivia
  • 814
  • 1
  • 14
  • 26
0

I would use the tm-package. It has a little dictionary with english stopwords. You can replace these stopwords with a white space using gsub():

library(tm)
prep      <- tolower(paste(" ", df$C3, " "))
regex_pat <- paste(stopwords("en"), collapse = " | ")
df$C4     <- gsub(regex_pat, " ", prep)
df$C4     <- gsub(regex_pat, " ", df$C4)
#    C3               C4
# 1  hello everyone   hello everyone  
# 2    hope you all             hope  
# 3  Are doing good             good  
# 4         In life             life  
# 5     I need help        need help 

You can easily add new words like c("hello", "othernewword", stopwords("en")).

KenHBS
  • 6,756
  • 6
  • 37
  • 52
  • Thanks for the solution. This works fine with the Demo sample Data, but on real data (that consist of 33000 rows) R stops working and I have to restart the session everytime. I tried for 6-7 times, but still it doesn't work. – Fraxxx Jul 28 '17 at 12:58
  • Sometimes it helps to just wait a bit when it seems as if RStudio has stopped working, instead of restarting. I am not sure what the cause could be, to be honest. Also, maybe you can split it up in digestible parts of, say, 5.000 rows? – KenHBS Jul 28 '17 at 13:02