1

I am doing some basic NLP work in R. I have two data sets and want to replace the words in one with the cluster value of each word from the other.

The first data set holds sentences and the second one the cluster value for each word (assume that every word in first data set has a cluster value):

original_text_df <- read.table(text="Text
'this is some text'
'this is more text'", header=T, sep="") 

cluster_df <- read.table(text="Word Cluster
this 2
is 2 
some 3
text 4
more 3", header=T, sep="") 

This is the desired transformed output:

Text
"2 2 3 4"
"2 2 3 4"

Looking for an efficient solution as I have long sentences and many of them. Thanks!

amunategui
  • 1,130
  • 2
  • 11
  • 15
  • Aside from the MWE you created with arbitrary "cluster values", how does one calculate those values? That is, what package and function creates cluster values from string vectors? – lawyeR Mar 12 '15 at 01:02
  • Yes, I think we need a bit more information to answer this question... – Ben Mar 12 '15 at 01:07
  • not an R package - comes from Gensim in Python – amunategui Mar 12 '15 at 02:34

1 Answers1

1

You could try something like this:

library(tidyr)
library(dplyr)
library(stringi)

df1 <- unnest(stri_split_fixed(original_text_df$Text, ' '), group) %>%
  group_by(x) %>% mutate(cluster = cluster_df$Cluster[cluster_df$Word %in% x]) 

Which gives:

#Source: local data frame [8 x 3]
#Groups: x
#
#  group    x cluster
#1    X1 this       2
#2    X1   is       2
#3    X1 some       3
#4    X1 text       4 
#5    X2 this       2
#6    X2   is       2
#7    X2 more       3
#8    X2 text       4

From there, to match your expected output, you could build a list of clusters for each group (sentence) using split() and reconstruct a data frame:

l <- split(df1$cluster, f = df1$group)
df2 <- data.frame(Text = do.call(rbind, lapply(l, paste0, collapse = " ")))

And you will get:

#      Text
#X1 2 2 3 4
#X2 2 2 3 4

You can refer to this pretty similar question I asked a few months ago showing lots of other suggestions.

Community
  • 1
  • 1
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    Thanks for your solution and the link - my problem is convoluted due to string splits on both end to handle irregular number of words. – amunategui Mar 12 '15 at 02:36