EDIT: I got my for() loop to work finally by eschewing the "in seq_along()" and going with a more familiar "1:(nrow(df))". Also (crucially), I made it more efficient by inserting a break statement in the if() body:
for (i in 1:(nrow(urwiki))){
for(j in 1:(nrow(unique_names))){
if(identical(unique_names[[j, "editor"]], urwiki[[i,"editor"]]) ){
cohort_vector[[i]] <- unique_names[[j, "cohort"]]
break
}
}
}
HOWEVER, it still took over an hour (760,000 rows * 11,000 possible matches = 8 trillion or so in the worst case) so if anyone can tell me how to "vectorize" this operation in the future, I would appreciate it.
Original question follows...
I want to create a vector of classifications based on a dataframe/tibble.
The "levels" in question are Strings, so perhaps there is a better way to do this, but I had no luck with a for() loop, and read that it is often better to deal with vector methods in R.
I have tried using the lapply() method found here:
Applying the same factor levels to multiple variables in an R data frame
urwiki["editor"] <- lapply(urwiki["editor"], factor,
levels = unique_names$editor,
labels = unique_names$cohort)
The error I receive reports that the attempted labels vector is one value too long:
Error in FUN(X[[i]], ...) :
invalid 'labels'; length 12863 should be 1 or 12862
Both the levels and labels inputs are from the same dataframe, which has a height of 12863, so why does it want a vector length of one less?
I also tried this in the purrr package:
cohort_vector <- map_int(urwiki$editor, factor,
levels = unique_names$editor,
labels = unique_names$cohort)
with the corresponding error:
Error in .f(.x[[i]], ...) :
invalid 'labels'; length 12863 should be 1 or 12862
The tibble:
urwiki <- structure(list(articleid = c("4", "4", "4", "4", "4", "4"),
date_time = c("1/27/2004 17:36",
"2/20/2004 13:40", "3/3/2004 18:31", "3/3/2004 18:47", "3/3/2004 18:55",
"3/3/2004 19:01"), editor = c("Steve", "Jim",
"Terry", "Steve", "Rachel", "Harvey"
), year = c("2004", "2004", "2004", "2004", "2004", "2004")), .Names =
c("articleid",
"date_time", "editor", "year"), row.names = c(NA, -6L), class =
c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "year", drop = TRUE, indices = list(
0:5), group_sizes = 6L, biggest_group_size = 6L, labels =
structure(list(
year = "2004"), row.names = c(NA, -1L), class = "data.frame", vars =
"year", drop = TRUE, .Names = "year"))
The tibble looks like this:
anon articleid date_time deleted editor
<lgl> <int> <chr> <lgl> <chr>
TRUE 4 1/27/2004 17:36 FALSE Steve
TRUE 4 2/20/2004 13:40 FALSE Jim
TRUE 4 3/3/2004 18:31 FALSE Terry
TRUE 4 3/3/2004 18:47 FALSE Steve
TRUE 4 3/3/2004 18:55 FALSE Rachel
I have made a separate tibble that identifies each unique editor and the year that they first appear:
unique_names <- structure(list(cohort = c("2004", "2004", "2004", "2004",
"2004", "2004"), editor = c("Jim", "Steve", "Harvey", "Rachel", "Terry",
"139.164.251.34"), n = c(65L, 2L, 1L, 1L, 1L, 9L)),
.Names = c("cohort", "editor",
"n"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("cohort", "editor"), drop = TRUE, indices =
list(
0L, 1L, 2L, 3L, 4L, 5L), group_sizes = c(1L, 1L, 1L, 1L,
1L, 1L), biggest_group_size = 1L, labels = structure(list(cohort = c("2004",
"2004", "2004", "2004", "2004", "2004"), editor = c("Jim",
"Steve", "Harvey",
"Rachel", "Terry", "139.164.251.34")), row.names = c(NA,
-6L), class = "data.frame", vars = c("cohort", "editor"), drop = TRUE,
.Names = c("cohort",
"editor")))
it looks like:
cohort editor
<chr> <chr>
2004 Jim
2004 Steve
2004 Harvey
so I am trying to make a vector the length of the original set, that identifies each editor by its cohort. Then I can add that vector to the original tibble to associate each row with the cohort of the editor, rather than just the year that it was created. In this example, the vector would just be a vector of 6 "2004"s.
When I run the map_int function on the above head() data, it does not give me the error, but also doesn't return the vector I need.
The for() loop I mentioned before looked something like this:
cohort_vector <- vector("integer", nrow(urwiki2))
for (i in seq_along(urwiki2)){
for(j in seq_along(unique_names)){
if(identical(unique_names[[j, "editor"]], urwiki2[[i,"editor"]]) ){
cohort_vector[[i]] <- unique_names[[j, "cohort"]]
}
}
}
This for loop works on the sample data above, but does not repeat names, ie. the second "Steve" will not be matched and the value will return 0. However, when I run this with my actual data set(700,000+ rows), I just end up with a vector of 700,000 zeroes.