0

I want to see if I can visualise who is publishing with whom in peer-reviewed journals for a certain subject. To do this I have typed the keyword "Barrett's" into pubmed and downloaded a large file which gives me two columns, Title and Author

structure(list(Title = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("A case of Barrett's adenocarcinoma with marked endoscopic morphological changes in Barrett's esophagus over a long follow-up period of 15\xe4\xf3\x8ayears.", 
"APE1-mediated DNA damage repair provides survival advantage for esophageal adenocarcinoma cells in response to acidic bile salts.", 
"Healthcare Cost of Over-Diagnosis of Low-Grade Dysplasia in Barrett's Esophagus.", 
"Radiofrequency ablation coupled with Roux-en-Y gastric bypass: a treatment option for morbidly obese patients with Barrett's esophagus.", 
"Risk factors for Barrett's esophagus."), class = "factor"), 
    Author = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("Arora Z, Garber A, Thota PN.", 
    "Hong J, Chen Z, Peng D, Zaika A, Revetta F, Washington MK, Belkhiri A, El-Rifai W.", 
    "Iwaya Y, Yamazaki T, Watanabe T, Seki A, Ochi Y, Hara E, Arakura N, Tanaka E, Hasebe O.", 
    "Lash RH, Deas TM Jr, Wians FH Jr.", "Parikh K, Khaitan L."
    ), class = "factor")), .Names = c("Title", "Author"), row.names = c(NA, 
5L), class = "data.frame")

I want to count how many times one author has published with another author. I thought the best way to do this would be to create a co-occurrency matrix (later I'll be using igraph).

I am having some problem understanding how to convert my data into such a matrix. I guess it would involve listing all the authors as column names and also as row names and then iterating through each row of the Auth dataframe and recording the co-occurrence of two names in the matrix. Is there a quick way to do this. I am lost in how to approach this. So I tried this:

1.Extract all the names into a long list from the Author column
2.Then create colnames from the Author list
3.Then create rownames from the Author list
4.Then somehow iterate through Auth[2] and count the name co-occurrence

...but I get stuck at the first extraction which I tried with:

AuthSplit<-strsplit(Auth$Author, ",", fixed=T)
AuthSplit<-as.data.frame(AuthSplit)

but I get this error:

 Error in data.frame(c("Iwaya Y", " Yamazaki T", " Watanabe T", " Seki A",  : 
  arguments imply differing number of rows: 9, 2, 3, 8, 20, 5, 1, 11, 4, 23, 6, 15, 16, 7, 12, 10, 14, 21, 13, 18, 19, 17, 22

There must be an easier way?

Sebastian Zeki
  • 6,690
  • 11
  • 60
  • 125
  • 1
    See [this QA](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses); `crossprod(table(rep(seq_along(AuthSplit), lengths(AuthSplit)), unlist(AuthSplit)))` – alexis_laz Mar 08 '16 at 20:08
  • OK that gives me an odd error: could not find function "lengths" – Sebastian Zeki Mar 09 '16 at 00:44
  • `lengths` was introduced in recent R versions; it's -essentialy- `sapply(AuthSplit, length)`. – alexis_laz Mar 09 '16 at 07:23

1 Answers1

2

If you have a large number of authors, the adjacency matrix could be quite large. Rather you can create a list of pairs of authors which igraph can use to create a graph. The basic approach is to form a list of vectors of individual authors for each paper and then create a data frame of pairs of authors for each paper keeping only those in which the first author comes alphabetically before the second author. The list of data frames for each paper is then combined into one larger data frame. A data frame with unique author pairs and the number of papers for each author pair is formed. This data frame is used to create a graph where the paper count is stored in the graph with the edge definitions. This count can be displayed on a plot of the graph.

I've add a couple of papers to your list to include the cases where the same authors appear on more than one paper and the same pair are authors of more than one paper. The code looks like

library(igraph)
#  add papers with authors from previous papers
  Auth <- rbind(Auth, 
              data.frame(Title=c("Paper A","Paper B"), 
                         Author=c("Iwaya Y, Parikh K, Lash RH", "Wians FH Jr., Lash RH")))

# create list of individual authors for each paper
  pub_auths <- sapply(Auth$Author, function(x) strsplit(as.character(x), split=","))
  pub_auths <- lapply(pub_auths, trimws)
# for each paper, form a data frame of unique author pairs 
  auth_pairs <- lapply(pub_auths, function(x) { z  <-  expand.grid(x, x, stringsAsFactors=FALSE);
                                        z[z$Var1 < z$Var2,]   })
# combine list of matrices for each paper into one data frame
  auth_pairs <- do.call(rbind, auth_pairs)
# count papers for each author pair
  auth_count <- aggregate( paste(Var1, Var2)  ~ Var1 + Var2 , data=auth_pairs, length)
  colnames(auth_count) <- c("Author1","Author2","Paper_count")
# create graph from author pairs
  g <- graph_from_data_frame(auth_count, directed=FALSE)
# plot graph
   plot(g, edge.label=E(g)$Paper_count, edge.label.cex=1.4, vertex.label.cex=1.4)

In the plot, the paper counts are shown as labels of the edges. Notice that Wians and Lash have two papers which includes the papers added to the data.

enter image description here

WaltS
  • 5,410
  • 2
  • 18
  • 24
  • OK. I like the approach. I get a couple of errors. Firstly I got the error In strsplit(as.character(x), split = ",") :input string 1 is invalid in this locale when I ran pub_auths <- sapply(Auth$Author, function(x) strsplit(as.character(x), split=",")). So I changed the locale to C which made this go away. Then I got Error in match.fun(FUN) : object 'trimws' not found. So I install library(memisc) which made this go. Then I got the error Error in xj[i] : invalid subscript type 'builtin' when I ran the auth_count line. Stumped on that – Sebastian Zeki Mar 10 '16 at 11:23
  • `Locale` is dependent upon both R `locale` settings and O/S. My locale is `English_United States.1252` on Windows 8.1. `C` may be close to that. Glad that's working for you. `trimws` as well as `lengths` (from your earlier comment) are new functions introduced in R version 3.2.0 . If possible, you might consider updating your version. I'm using 3.2.3. A new version 3.2.4 was just released this morning. – WaltS Mar 10 '16 at 15:30
  • For `aggregate`, you might look at `auth_pairs` to verify that it seems OK. Alternative code would be `auth_list <- lapply(split(auth_pairs, list(auth_pairs$Var1,auth_pairs$Var2), drop=TRUE), function(x) data.frame(Auth1=x[1,1], Auth2=x[1,2], Paper_count=nrow(x)) ) auth_count <- do.call(rbind, auth_list)` but it's hard to see why this should be necessary. – WaltS Mar 10 '16 at 15:31
  • OK. I sorted it. Some aggregate function weirdness I fixed by changing the format of the function. I edited your code. Now it works nicely. Thanks! – Sebastian Zeki Mar 10 '16 at 16:34