I have a large data frame of authors and corresponding texts (approximately 450,000 records). From the data frame I extracted two vectors respectively for authors and texts, such as:
author <- c("Sallust",
"Tacitus",
"Justin",
"Cato the Elder",
"Claudius",
"Quintus Fabius Pictor",
"Justin",
"Claudius",
"Cato the Elder",
"Tacitus",
"Sallust")
text <- c("Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet")
My goal is to subset the data set in chucks sufficiently small to be text mined but still keeping all the records with the same author
in the same chunk.
I noticed that extracting the vectors author
and text
from the original data frame is fast BUT combining the extracted vectors in a new data frame is extremely slow. So I guess I should avoid creating the data frame with all the records.
Probably the "smart" solution would be:
- Order the vector
author
alphabetically (so to make sure records with the same author are contiguous); - Order the vector
text
based on the ordering of the vectorauthor
; - Create a logical vector (TRUE/FALSE) indicating if the author is the same author of the previous value;
- Create an vector
splitAt
containing the indexes of the vectorsauthor
andtext
where to split; - Split the vectors.
In code, assuming my procedure makes sense, I got the first 3 steps working:
# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]
same_author <- duplicated(author)
But I don't know how to proceed further. Probably should be something like:
# Index for splitting
max_length <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% max_length)) - 1
# Initialise vector (not sure it needs value 2 to indicate first index where to split)
splitAt <- 1
for (n in num_chunks){
index <- n * max_length + 1
while (same_author[index]!=FALSE) {
splitAt <- append(splitAt, index)
index <- index + 1
}
}