1

EDIT: I got my for() loop to work finally by eschewing the "in seq_along()" and going with a more familiar "1:(nrow(df))". Also (crucially), I made it more efficient by inserting a break statement in the if() body:

for (i in 1:(nrow(urwiki))){
  for(j in 1:(nrow(unique_names))){
    if(identical(unique_names[[j, "editor"]], urwiki[[i,"editor"]]) ){
      cohort_vector[[i]] <- unique_names[[j, "cohort"]]
      break
    }
  }

}

HOWEVER, it still took over an hour (760,000 rows * 11,000 possible matches = 8 trillion or so in the worst case) so if anyone can tell me how to "vectorize" this operation in the future, I would appreciate it.

Original question follows...

I want to create a vector of classifications based on a dataframe/tibble. The "levels" in question are Strings, so perhaps there is a better way to do this, but I had no luck with a for() loop, and read that it is often better to deal with vector methods in R. I have tried using the lapply() method found here:

Applying the same factor levels to multiple variables in an R data frame

urwiki["editor"] <- lapply(urwiki["editor"], factor, 
           levels = unique_names$editor, 
           labels = unique_names$cohort)

The error I receive reports that the attempted labels vector is one value too long:

Error in FUN(X[[i]], ...) : 
invalid 'labels'; length 12863 should be 1 or 12862

Both the levels and labels inputs are from the same dataframe, which has a height of 12863, so why does it want a vector length of one less?

I also tried this in the purrr package:

cohort_vector <- map_int(urwiki$editor, factor,
                     levels = unique_names$editor, 
                     labels = unique_names$cohort)

with the corresponding error:

Error in .f(.x[[i]], ...) : 
invalid 'labels'; length 12863 should be 1 or 12862

The tibble:

urwiki <- structure(list(articleid = c("4", "4", "4", "4", "4", "4"), 
date_time = c("1/27/2004 17:36", 
"2/20/2004 13:40", "3/3/2004 18:31", "3/3/2004 18:47", "3/3/2004 18:55", 
"3/3/2004 19:01"), editor = c("Steve", "Jim", 
"Terry", "Steve", "Rachel", "Harvey"
), year = c("2004", "2004", "2004", "2004", "2004", "2004")), .Names = 
c("articleid", 
"date_time", "editor", "year"), row.names = c(NA, -6L), class = 
c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "year", drop = TRUE, indices = list(
    0:5), group_sizes = 6L, biggest_group_size = 6L, labels = 
structure(list(
    year = "2004"), row.names = c(NA, -1L), class = "data.frame", vars = 
"year", drop = TRUE, .Names = "year"))

The tibble looks like this:

 anon articleid       date_time deleted          editor
 <lgl>     <int>           <chr>   <lgl>           <chr>
 TRUE         4 1/27/2004 17:36   FALSE           Steve
 TRUE         4 2/20/2004 13:40   FALSE             Jim
 TRUE         4  3/3/2004 18:31   FALSE           Terry
 TRUE         4  3/3/2004 18:47   FALSE           Steve
 TRUE         4  3/3/2004 18:55   FALSE          Rachel

I have made a separate tibble that identifies each unique editor and the year that they first appear:

unique_names <- structure(list(cohort = c("2004", "2004", "2004", "2004", 
"2004", "2004"), editor = c("Jim", "Steve", "Harvey", "Rachel", "Terry", 
"139.164.251.34"), n = c(65L, 2L, 1L, 1L, 1L, 9L)), 
.Names = c("cohort", "editor", 
 "n"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), vars = c("cohort", "editor"), drop = TRUE, indices = 
list(
    0L, 1L, 2L, 3L, 4L, 5L), group_sizes = c(1L, 1L, 1L, 1L, 
1L, 1L), biggest_group_size = 1L, labels = structure(list(cohort = c("2004", 

"2004", "2004", "2004", "2004", "2004"), editor = c("Jim", 
"Steve", "Harvey", 
"Rachel", "Terry", "139.164.251.34")), row.names = c(NA, 
-6L), class = "data.frame", vars = c("cohort", "editor"), drop = TRUE, 
.Names = c("cohort", 
"editor")))

it looks like:

cohort                                           editor
<chr>                                            <chr>
2004                                            Jim
2004                                            Steve
2004                                            Harvey

so I am trying to make a vector the length of the original set, that identifies each editor by its cohort. Then I can add that vector to the original tibble to associate each row with the cohort of the editor, rather than just the year that it was created. In this example, the vector would just be a vector of 6 "2004"s.

When I run the map_int function on the above head() data, it does not give me the error, but also doesn't return the vector I need.

The for() loop I mentioned before looked something like this:

cohort_vector <- vector("integer", nrow(urwiki2))
for (i in seq_along(urwiki2)){
  for(j in seq_along(unique_names)){
    if(identical(unique_names[[j, "editor"]], urwiki2[[i,"editor"]]) ){
      cohort_vector[[i]] <- unique_names[[j, "cohort"]]
    }
  }

}

This for loop works on the sample data above, but does not repeat names, ie. the second "Steve" will not be matched and the value will return 0. However, when I run this with my actual data set(700,000+ rows), I just end up with a vector of 700,000 zeroes.

Kevin Mc
  • 477
  • 4
  • 14
  • Can you provide an [example data frame](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver) along with expected output? – Luke C Nov 24 '17 at 23:47
  • 1
    I think that should be reproducible, I changed the editors to generic names instead of ISPs – Kevin Mc Nov 25 '17 at 01:23
  • 1
    What is *uniquenames*? Please test your sample data and code can reproduce your issue from our empty R environments. Also desired results may help as textual explanation is not too clear. – Parfait Nov 25 '17 at 05:21
  • 1
    Why not just merge or join the two tibbles to get editor's *cohort* in original one? – Parfait Nov 25 '17 at 15:13
  • the unique_names tibble is constructed by finding all of the unique editor names and associating them with the first year they appear. It is only about 12000 rows, whereas the larger set is about 760,000 rows. I am trying to apply a label to each editor in the larger set, based on its cohort in the smaller set. – Kevin Mc Nov 25 '17 at 16:29

0 Answers0