Split a column and count the number of elements + sum elements

Question

I am working with some data where one of the columns looks like

21070808(136)|19995886(87)|21280165(66)

20226255(57)|21440646(54)

...

Just to be clear, this is a single column. Each number which is not in parenthesis represents a publication id (e.g., 21070808) and the number in parenthesis represents the number of citations that this publication received (e.g., publication 21070808 received 136 citations).

For each observation, I would like to count the number of publications as well as the total number of citations. For instance taking the 2 observations above, I would like to get 2 columns (column1=Number of publications and column2=Citations):

 Number of publications - Citations

             3          -     289
             2          -     111

I have tried to look for solutions in R/Stata but could not get anything to work. I think for the number of publications I could just count the number of "|" character and add +1. But for the number of total citations, I am a bit more confused...

Any help would be really appreciated. I am indifferent between R/Stata (and even Python) :)

score 1 · Answer 1 · answered Sep 14 '22 at 23:05

library(tidyverse)

df %>%
  rowid_to_column() %>%
  separate_rows(col1, sep = '[|]')%>%
  separate(col1, c('num', 'cit'),convert = TRUE, extra = 'drop')%>%
  group_by(rowid)%>%
  summarise(num = n(), cit = sum(cit))%>%
  select(-rowid)

# A tibble: 2 x 2
    num   cit
  <int> <int>
1     3   289
2     2   111

M.Viking · Accepted Answer · 2022-09-14T23:05:10.293

df<-data.frame(x=c("21070808(136)|19995886(87)|21280165(66)","20226255(57)|21440646(54)"))

df$count<-str_count(df$x, "\\|")+1

df$sum<-str_extract_all(df$x, "(?<=\\()[^()]*(?=\\))")

df$sum<-lapply(df$sum, function(y) sum(as.numeric(y)))

df

                                        x sum count
1 21070808(136)|19995886(87)|21280165(66) 289     3
2               20226255(57)|21440646(54) 111     2

Regex sources to match data in parenthesis - Pattern to match only characters within parentheses

score 0 · Answer 3 · answered Sep 15 '22 at 06:39

Here is a Stata solution.

clear 
input str42 problem 
"21070808(136)|19995886(87)|21280165(66)"
"20226255(57)|21440646(54)"
end 

gen count = strlen(problem) - strlen(subinstr(problem, "|", "", .)) + 1 

* ssc install moss 
moss problem, match("(\([0-9]+)") regex 
destring _match*, ignore("(") replace 
egen citations = rowtotal(_match*)

keep problem count citations 
list 

     +------------------------------------------------------------+
     |                                 problem   count   citati~s |
     |------------------------------------------------------------|
  1. | 21070808(136)|19995886(87)|21280165(66)       3        289 |
  2. |               20226255(57)|21440646(54)       2        111 |
     +------------------------------------------------------------+

score 0 · Answer 4 · answered Sep 15 '22 at 16:34

Using base R

lst1 <- type.convert(regmatches(df$x, gregexpr("(?<=\\()\\d+", df$x, 
  perl = TRUE)), as.is = TRUE)
data.frame(Numer_of_publications = lengths(lst1), Citations = sapply(lst1, sum))
  Numer_of_publications Citations
1                     3       289
2                     2       111

Split a column and count the number of elements + sum elements

4 Answers4