1

I am working with some data where one of the columns looks like

21070808(136)|19995886(87)|21280165(66)

20226255(57)|21440646(54)

...

Just to be clear, this is a single column. Each number which is not in parenthesis represents a publication id (e.g., 21070808) and the number in parenthesis represents the number of citations that this publication received (e.g., publication 21070808 received 136 citations).

For each observation, I would like to count the number of publications as well as the total number of citations. For instance taking the 2 observations above, I would like to get 2 columns (column1=Number of publications and column2=Citations):

 Number of publications - Citations

             3          -     289
             2          -     111 

I have tried to look for solutions in R/Stata but could not get anything to work. I think for the number of publications I could just count the number of "|" character and add +1. But for the number of total citations, I am a bit more confused...

Any help would be really appreciated. I am indifferent between R/Stata (and even Python) :)

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
ClaraJ
  • 57
  • 4

4 Answers4

1
library(tidyverse)

df %>%
  rowid_to_column() %>%
  separate_rows(col1, sep = '[|]')%>%
  separate(col1, c('num', 'cit'),convert = TRUE, extra = 'drop')%>%
  group_by(rowid)%>%
  summarise(num = n(), cit = sum(cit))%>%
  select(-rowid)

# A tibble: 2 x 2
    num   cit
  <int> <int>
1     3   289
2     2   111
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0
df<-data.frame(x=c("21070808(136)|19995886(87)|21280165(66)","20226255(57)|21440646(54)"))

df$count<-str_count(df$x, "\\|")+1

df$sum<-str_extract_all(df$x, "(?<=\\()[^()]*(?=\\))")

df$sum<-lapply(df$sum, function(y) sum(as.numeric(y)))

df               
                                        x sum count
1 21070808(136)|19995886(87)|21280165(66) 289     3
2               20226255(57)|21440646(54) 111     2

Regex sources to match data in parenthesis - Pattern to match only characters within parentheses

M.Viking
  • 5,067
  • 4
  • 17
  • 33
0

Here is a Stata solution.

clear 
input str42 problem 
"21070808(136)|19995886(87)|21280165(66)"
"20226255(57)|21440646(54)"
end 

gen count = strlen(problem) - strlen(subinstr(problem, "|", "", .)) + 1 

* ssc install moss 
moss problem, match("(\([0-9]+)") regex 
destring _match*, ignore("(") replace 
egen citations = rowtotal(_match*)

keep problem count citations 
list 

     +------------------------------------------------------------+
     |                                 problem   count   citati~s |
     |------------------------------------------------------------|
  1. | 21070808(136)|19995886(87)|21280165(66)       3        289 |
  2. |               20226255(57)|21440646(54)       2        111 |
     +------------------------------------------------------------+
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
0

Using base R

lst1 <- type.convert(regmatches(df$x, gregexpr("(?<=\\()\\d+", df$x, 
  perl = TRUE)), as.is = TRUE)
data.frame(Numer_of_publications = lengths(lst1), Citations = sapply(lst1, sum))
  Numer_of_publications Citations
1                     3       289
2                     2       111
akrun
  • 874,273
  • 37
  • 540
  • 662