Sum the total number of strings separated by comma

Question

structure(list(Other = c(NA_character_, NA_character_, NA_character_,
                         NA_character_, NA_character_),
              Years = c("2005, 2005, 2006, 2006, 2007", "2011, 2014",
                        "2007", "2011, 2011, 2011, 2012, 2012, 2012",
                        "2006, 2006, 2012, 2012, 2015")),
         .Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")

Given the above data frame, the second column has a bunch of years arranged with commas. I'd like to create a new column which adds the total number of years in each element in the column. So the final data frame looks like this:

structure(list(Other = c(NA_character_, NA_character_, NA_character_,
                         NA_character_, NA_character_),
               Years = c("2005, 2005, 2006, 2006, 2007","2011, 2014", "2007",
                         "2011, 2011, 2011, 2012, 2012, 2012",
                         "2006, 2006, 2012, 2012, 2015"), 
               yearlength = c(5, 2, 1, 6, 5)),
         .Names = c("Other", "Years", "yearlength"), row.names = 1:4, class = "data.frame")

I've tried with solution such as stack$yearlength <- count.fields(textConnection(stack), sep = ",") but I can't quite get it to work.

missuse · Accepted Answer · 2018-06-29T10:09:12.370

One approach is to count the commas and add 1

df$yearlength <- stringr::str_count(df$Years, ",")+1
df
#output
  Other                              Years yearlength
1  <NA>       2005, 2005, 2006, 2006, 2007          5
2  <NA>                         2011, 2014          2
3  <NA>                               2007          1
4  <NA> 2011, 2011, 2011, 2012, 2012, 2012          6
5  <NA>       2006, 2006, 2012, 2012, 2015          5

another would be to count the spans of digits:

df$yearlength <- stringr::str_count(df$Years, "\\d+")

A third option (thanks to Sotos's comment) would be to count the words:

stringi::stri_count_words(df$Years)

or

stringr::str_count(df$Years, "\\w+")

Fourth option is to count the non spaces:

stringr::str_count(df$Years, "\\S+")

all.equal(stringr::str_count(df$Years, ",")+1,
          stringr::str_count(df$Years, "\\d+"),
          stringi::stri_count_words(df$Years),
          stringr::str_count(df$Years, "\\w+"),
          stringr::str_count(df$Years, "\\S+"))

EDIT: when NA present in the data set:

df[3,2] <- NA

all of the above solutions produce #output 5 2 NA 6 5

to change NA to 0:

df$yearlength[is.na(df$yearlength)] <- 0
#output
  Other                              Years yearlength
1  <NA>       2005, 2005, 2006, 2006, 2007          5
2  <NA>                         2011, 2014          2
3  <NA>                               <NA>          0
4  <NA> 2011, 2011, 2011, 2012, 2012, 2012          6
5  <NA>       2006, 2006, 2012, 2012, 2015          5

Data (since the data in the question is corrupt):

df <- structure(list(Other = c(NA_character_, NA_character_, NA_character_, 
                         NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007", 
                                                                  "2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
                         )), .Names = c("Other", "Years"), row.names = 1:5, class = "data.frame")

You could also use `stringi` and do `stringi::stri_count_words` — Sotos, Jun 29 '18 at 09:40
Thanks for the answer. My problem arises when I try to apply it to NA values. It seems to count those as 1 rather than 0. — WoeIs, Jun 29 '18 at 09:59
all of the proposed solutions do not count them: like: `stringr::str_count(df$Years, "\\w+")` but produce `NA` in place. See edit how to replace `NA` with `0`. — missuse, Jun 29 '18 at 10:06

Roman Luštrik · Answer 2 · 2018-06-29T11:24:01.987

1

You can split according to a comma and then just find length of the vector.

> sapply(strsplit(xy$Years, ","), length)
[1] 5 2 1 6 5

Added to account for an NA (example from @missuse):

xy <- structure(list(Other = c(NA_character_, NA_character_, NA_character_, 
                         NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007", 
                                                                  "2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
                         )), .Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")

xy[3, 2] <- NA

sapply(strsplit(xy$Years, ","), FUN = function(x) {
  length(na.omit(x))
})

[1] 5 2 0 6 5

edited Jun 29 '18 at 11:24

answered Jun 29 '18 at 09:29

Roman Luštrik

69,533
24
154
197

2

or `lengths(strsplit(xy$Years, ","))` – Jaap Jun 29 '18 at 09:32
Thanks for the answer. Is there any way to make it not count NA values? – WoeIs Jun 29 '18 at 09:54
@WoeIs this is why I wrapped the result into an `sapply`. Instead of `length` you can specify an anonymous function where you can process each row/element however you please. – Roman Luštrik Jun 29 '18 at 11:21

Sum the total number of strings separated by comma

2 Answers2