2

I have a data frame with 4 columns. Column 1 consists of ID's, column 2 consists of texts (about 100 words each), column 3 and 4 consist labels.

Now I would like to retrieve word frequencies (of the most common words) from the texts column and add those frequencies as extra columns to the data frame. I would like the column names to be the words themselves and the columns filled with their frequencies (ranging from 0 to ... per text) in the texts.

I tried some functions of the tm package but until now unsatisfactory. Does anyone has any idea how to deal with this problem or where to start? Is there a package that can do the job?

id  texts   label1    label2
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
rdatasculptor
  • 8,112
  • 14
  • 56
  • 81
  • Try `install.packages("sos"); library("sos"); findFn("word frequency")` and you'll see quite a few options to dig through. I don't know this field but a quick look suggests the capability probably already exists. – Bryan Hanson Mar 06 '13 at 22:06
  • Thanks Bryan! I will have a look. – rdatasculptor Mar 06 '13 at 22:07
  • Unfortenately I am not quite sure sos will help me dealing with my problem. – rdatasculptor Mar 06 '13 at 22:19
  • I found 185 matches - none of them look promising? Or is it more about integrating what you have already done with one of the existing functions? You might need to loop over your existing data frame and accumulate your answer in some structure, but that's a slightly different question than you originally asked. If that's the case, post a small sample of your data so we can figure out a solution. – Bryan Hanson Mar 06 '13 at 22:21
  • maybe I was too quick with my conclusion. I thought sos doesn't mine in data frames – rdatasculptor Mar 06 '13 at 22:27
  • If you posted some sample data we could provide much better assistance. – Tyler Rinker Mar 06 '13 at 22:27
  • `sos` is a utility package to help one find help pages and the existence of various functions. It's not the solution to your problem. Explore the web pages that are linked from the web page that `findFn` produces. It will take you a while. We'll be around... – Bryan Hanson Mar 06 '13 at 22:29
  • I feel a bit silly, but first i have to find out how to put a table/data frame in this page... I am quite new here. – rdatasculptor Mar 06 '13 at 22:38
  • 1
    @user1983395 paste the `head` rows of the data or a `dput(head(YOUTDATA))` and then highlight and click the curly braces looking icon. Also what do you mean by ``of the most common words''? Give a specific number (i.e. top 10). Also do you mean of the most common words- used by all rows? – Tyler Rinker Mar 06 '13 at 22:39
  • How do you propose to "add the frequencies as an extra column" when there are multiple words per entry and multiple rows that might be contributing to the frequency of a word? – IRTFM Mar 06 '13 at 22:43
  • All possible words have to be retrieved from all rows in total, but the frequencies of each word are the 'scores' per text. – rdatasculptor Mar 06 '13 at 22:48
  • @Dwin the user said multiple column **s** – Tyler Rinker Mar 06 '13 at 22:56
  • By 'most common words' i mean as much word frequencies as possible, but i can imagine that that will turn into way too many extra columns. I hope i can decide later which words i keep and which I can remove as a frequency column from the data frame – rdatasculptor Mar 06 '13 at 22:59
  • @user1983395 That's not what I've asked for (your edit). We want actual text data to play with not just your column names. I think I know exactly how to help but I'm not going to guess. Please post data. – Tyler Rinker Mar 06 '13 at 23:01
  • Just noticed you're new to stackoverflow so you may want to check out [this link suggestion](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on a reproducible example. If you follow this you get an answer much more quickly. – Tyler Rinker Mar 06 '13 at 23:14
  • Thanks Tyler, I will have a look! (I knew that was not what you were asking for, but I haven't made it work yet). I have to go now, and hopefully back within a few hours. I will do my best making my question as clear as possible. Kind regards – rdatasculptor Mar 06 '13 at 23:31

1 Answers1

7

Well let's work through the issues then...

I'm guessing you have a data.frame that looks like this:

       person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

This data set comes from the qdap package. to get qdap use install.packages("qdap").

Now to make the reproducible example I was talking about with your data set do what I'm doing here with the DATA data set from qdap.

DATA
dput(head(DATA))

Ok now for your original problem I think wfm will do what you want:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)

If you only wanted the top so many words use an ordering technique like I use here:

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9]      #top 9 words
top9 <- freqs[, names(ords)]                #grab those columns from freqs  
data.frame(DATA, top9, check.names = FALSE) #put it together

The outcome looks like this:

> data.frame(DATA, top9, check.names = FALSE)
       person sex adult                                 state code you we what not no it's is i fun
1         sam   m     0         Computer is fun. Not too fun.   K1   0  0    0   1  0    0  1 0   2
2        greg   m     0               No it's not, it's dumb.   K2   0  0    0   1  1    2  0 0   0
3     teacher   m     1                    What should we do?   K3   0  1    1   0  0    0  0 0   0
4         sam   m     0                  You liar, it stinks!   K4   1  0    0   0  0    0  0 0   0
5        greg   m     0               I am telling the truth!   K5   0  0    0   0  0    0  0 1   0
6       sally   f     0                How can we be certain?   K6   0  1    0   0  0    0  0 0   0
7        greg   m     0                      There is no way.   K7   0  0    0   0  1    0  1 0   0
8         sam   m     0                       I distrust you.   K8   1  0    0   0  0    0  0 1   0
9       sally   f     0           What are you talking about?   K9   1  0    1   0  0    0  0 0   0
10 researcher   f     1         Shall we move on?  Good then.  K10   0  1    0   0  0    0  0 0   0
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11   1  0    0   0  0    0  0 0   0
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Hi Tyler, I have an additional question. Perhaps you can give me a clue about how to deal with that. Would qdap being able to help me make variables that represent bigrams (and their frequencies as data) instead of just only the frequencies of single words? – rdatasculptor May 18 '13 at 12:57
  • Yes but ask as a separate question please. – Tyler Rinker May 18 '13 at 13:47
  • done: http://stackoverflow.com/questions/16626168/r-text-mining-how-to-change-texts-in-r-data-frame-column-into-several-columns – rdatasculptor May 18 '13 at 15:55