Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length

Question

Hi I'm working with the last example in this tutorial: Topics proportions over time. https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html

I run it for my data with this code

library(readxl)
library(tm)
# Import text data

tweets <- read_xlsx("C:/R/data.xlsx")

textdata <- tweets$text

#Load in the library 'stringr' so we can use the str_replace_all function. 
library('stringr')

#Remove URL's 
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")


textdata <- gsub("@\\w+", " ", textdata)  # Remove user names (all proper names if you're wise!)

textdata <- iconv(textdata, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)

textdata <- gsub("http.+ |http.+$", " ", textdata)  # Remove links

textdata <- gsub("[[:punct:]]", " ", textdata)  # Remove punctuation


#Change all the text to lower case
textdata <- tolower(textdata)



#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))


textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)

# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata))  # Create corpus object


#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)

ui = unique(dtm$i)
dtm.new = dtm[ui,]

#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new   <- datatm[rowTotals> 0, ]

library("ldatuning")
library("topicmodels")

k <- 7

ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)


#####################################################
#topics by year

tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)

tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)


top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")

# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames


# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")

# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here is the excel file to the input data https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file

I got the error when I run the line with the aggregate function, I can't find out what is going on with the aggregate, I created the "decade" variable the same as in the tutoria, I show it and looks ok, the theta variable is also ok.. I changed several times the aggregate function according for example to this post Error in aggregate.data.frame : arguments must have same length

But still have the same error.. please help

It seem that the vector-factor you are using to aggregate the matrix has a different length that the matrix itself. Im not familiar with text mining and the objectives of your work, but, its like you want to take the mean of theta, by decade, and the decade vector is indeed lager: > length(tweets$decade) [1] 3481 > nrow(theta) [1] 3214 — Santiago Capobianco, Feb 27 '19 at 17:48
Yes but how can that be possible if I do not have control over those vectors I mean they are created through LDA and the decade column is the same as the tweet initial vector.. I mean how could I change those dimensions? Furthermore I followed the tutorial as it is — Ana, Feb 27 '19 at 19:25

score 3 · Accepted Answer · answered Mar 07 '19 at 19:03

I am not sure what you want to achieve with the command

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)

As far as I see you produce only one decade with

tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)

2010 
3481

With all the preprocessing from tweets to textdata you're producing a few empty lines. This is where your problem starts. Textdata with its new empty lines is the basis of your corpus and your dtm. You get rid of them with the lines:

ui = unique(dtm$i)
dtm.new = dtm[ui,]

At the same time you're basically deleting the empty columns in the dtm, thereby changing the length of your object. This new dtm without the empty cells is then your new basis for the topic model. This is coming back to haunt you, when you try to use aggregate() with two objects of different lengths: tweets$decade, which is still the old length of 3418 with theta, that is produced by the topic model, which in turn is based on dtm.new -- remember, the one with fewer rows.

What I would suggest is to, first, get an ID-column in tweets. Later on you can use the IDs to find out what texts later on get deleted by your preprocessing and match the length of tweet$decade and theta.

I rewrote your code -- try this out:

library(readxl)
library(tm)
# Import text data

tweets <- read_xlsx("data.xlsx")

## Include ID for later
tweets$ID <- 1:nrow(tweets)

textdata <- tweets$text

#Load in the library 'stringr' so we can use the str_replace_all function. 
library('stringr')

#Remove URL's 
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")


textdata <- gsub("@\\w+", " ", textdata)  # Remove user names (all proper names if you're wise!)

textdata <- iconv(textdata, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)

textdata <- gsub("http.+ |http.+$", " ", textdata)  # Remove links

textdata <- gsub("[[:punct:]]", " ", textdata)  # Remove punctuation

#Change all the text to lower case
textdata <- tolower(textdata)

#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))

textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)

# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata))  # Create corpus object

#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]

#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new   <- datatm[rowTotals> 0, ]

library("ldatuning")
library("topicmodels")

k <- 7

ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)

#####################################################
#topics by year

tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)

id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")

tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)

Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length

1 Answers1