-1

I have a dataframe, I want to get weights by DTM or TDM of every word in a sentence. Out of those weights I want to get the maximum weight along with the word which carries that weight and then I want to apply calculation on each word weight.

My dataframe is given below:

       text                                
 1.   miralisitin manzoorpashteen     
 2.   She is best of best.                     
 3.   Try again and again.                     
 4.   Beware of this woman. She is bad woman.
 5.   Hold! hold and hold it tight.  

I want it to be like:

       text                                 wordweight    maxword   maxcount
1.  miralisitin manzoorpashteen                 1 1         NA        NA
2.  She is best of best.                      1 1 2 1       best       2
3.  Try again and again.                       1 2 1         again     2
4.  Beware of this woman. She is bad woman.  1 1 1 2 1 1 1   woman     2
5.  Hold! hold and hold it tight.             3 1 1 1         hold     3

How will I do this?

I have tried this using quanteda library but won't get the result as its dfm() function works on corpus not on dataframe. It can also be done by using tm library DTM or TDM but not like this.

M--
  • 25,431
  • 8
  • 61
  • 93

1 Answers1

1

The solution below will give you the frequency table of words in each sentence. You should be able to post process and get what you need.

library(stringr)

df <- structure(list(text = structure(c(3L, 4L, 5L, 1L, 2L), 
                           .Label = c("Beware of this woman. She is bad woman.", 
                            "Hold! hold and hold it tight.", "miralisitin manzoorpashteen", 
                            "She is best of best.", "Try again and again."), 
                class = "factor")), class = "data.frame", row.names = c(NA, -5L)) 

lapply(df$text, function(x) {table(
                              tolower(
                               unlist(
                                strsplit(
                                 gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
                                      as.character(str_replace_all(x, "[^[:alnum:]]", " ")), 
                                      perl=TRUE),
                                          " "))))})
#> [[1]] 
#> manzoorpashteen     miralisitin 
#>               1               1 
#> [[2]]
#> best   is   of  she 
#>    2    1    1    1 
#> 
#> [[3]]
#> again   and   try 
#>     2     1     1 
#> [[4]]
#>    bad beware     is     of    she   this  woman 
#>      1      1      1      1      1      1      2 
#> 
#> [[5]]
#>   and  hold    it tight 
#>     1     3     1     1

Created on 2019-05-01 by the reprex package (v0.2.1)

M--
  • 25,431
  • 8
  • 61
  • 93
  • but I want the output in dataframe. and what about getting the maximum weight out of these along with the word to do some calculation afterwards? @M-M – Mahnoor Akmal May 01 '19 at 18:18
  • 1
    Well that is post processing. This is a genuine comment and not intended to condemn. You are supposed to show us some effort as the community does not write codes for free. We are here to help, so you should try something and ask specific question not asking *write me a code that does this*. Moreover, you are supposed to ask one question per post. So, take this answer, try something, and then if you were struggling, post another question asking about the next steps. Cheers. @MahnoorAkmal – M-- May 01 '19 at 18:25
  • well, as I have mentioned above I have tried it out but not getting this as my output. I have numerous columns in my dataframe after combining my dtm with it. So, I asked it here to have this my output. – Mahnoor Akmal May 01 '19 at 18:29
  • @MahnoorAkmal You should've provided [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Take the [tour](https://stackoverflow.com/tour) and read up on [How to Ask](https://stackoverflow.com/help/how-to-ask), to understand that your question is out of the scope of this community. You should show your tries by a [Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve); What I provided you with, tell you how to get the frequencies. You may take it and try to make it work for you or just say thanks and move on. – M-- May 01 '19 at 18:33