0

I have a large data frame of authors and corresponding texts (approximately 450,000 records). From the data frame I extracted two vectors respectively for authors and texts, such as:

author <- c("Sallust",
            "Tacitus",
            "Justin",
            "Cato the Elder",
            "Claudius",
            "Quintus Fabius Pictor",
            "Justin",
            "Claudius",
            "Cato the Elder",
            "Tacitus",
            "Sallust")
text <- c("Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet",
          "Lorem ipsum dolor sit amet")

My goal is to subset the data set in chucks sufficiently small to be text mined but still keeping all the records with the same author in the same chunk.

I noticed that extracting the vectors author and text from the original data frame is fast BUT combining the extracted vectors in a new data frame is extremely slow. So I guess I should avoid creating the data frame with all the records.

Probably the "smart" solution would be:

  1. Order the vector author alphabetically (so to make sure records with the same author are contiguous);
  2. Order the vector text based on the ordering of the vector author;
  3. Create a logical vector (TRUE/FALSE) indicating if the author is the same author of the previous value;
  4. Create an vector splitAt containing the indexes of the vectors author and text where to split;
  5. Split the vectors.

In code, assuming my procedure makes sense, I got the first 3 steps working:

# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]

same_author <- duplicated(author)

But I don't know how to proceed further. Probably should be something like:

# Index for splitting
max_length <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% max_length)) - 1

# Initialise vector (not sure it needs value 2 to indicate first index where to split)
splitAt <- 1

for (n in num_chunks){
  index <- n * max_length + 1
  while (same_author[index]!=FALSE) {
      splitAt <- append(splitAt, index)
      index <- index + 1 
  }
}
CptNemo
  • 6,455
  • 16
  • 58
  • 107
  • 1
    I do not understand why do you extract vectors from your data frame. Why don't you just process the df by author with plyr::ddply() ? – Karl Forner Jan 09 '14 at 09:17
  • The whole data set it is actually a merging of two df (both with attribute `author` and `text` but also with non-shared attributes, which made them 'un-bindable'). I extracted the two vectors of interest from both df, then combined them. As last step I tried to put the vectors back together in a df but since they were of length 450,000 the process was terribly slow. I thought then to try a different approach... – CptNemo Jan 09 '14 at 09:27
  • And just getting from each of the `df` both the `author` and the `text and then `rbind`? I think it is also slow but maybe not as slow as creating again the data frame... – llrs Jan 09 '14 at 09:39
  • If you find a `data.frame` slow then use a `data.table` with an appropriate key and watch it fly... – Matt Weller Jan 09 '14 at 09:47
  • @Llopis Yes my idea was exactly not to recreate again the data frame but working on the vectors first, split them, then create multiple but small df. – CptNemo Jan 09 '14 at 09:51
  • And why do you need to split it? For the same reason you could get for each author a `dataframe` and just join the ones where the author come from the original 2 `df`? – llrs Jan 09 '14 at 10:05

1 Answers1

0

I found this solution (key algorithm is from here).

# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]

same_author <- duplicated(author)

    # Index for splitting
len_chunks <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% len_chunks)) - 1

# Initialise vector
splitAt_index <- numeric()

index <- len_chunks
for (n in 1:num_chunks){
  while (same_author[index]!=FALSE) {
    index <- index + 1 
  }
  splitAt_index <- append(splitAt_index, index)
  index <- index + len_chunks
}

# Function to split vector based on position indexes from https://stackoverflow.com/a/16358095/1707938
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))

author_list <- splitAt(author, splitAt_index)
text_list <- splitAt(text, splitAt_index)

for (i in 1:length(author_list)) {
  m <- cbind(author_list[[i]],text_list[[i]])
  assign(paste("corpus_",i , sep=""), m)
}

It seems quite fast. On MacBook Pro 2.4GHz 4GB with 5 character vectors of length 448,634:

   user  system elapsed 
 13.248   0.174  13.662 
Community
  • 1
  • 1
CptNemo
  • 6,455
  • 16
  • 58
  • 107