4

I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) together and ascertain how big these groups get. For example, if J.K. Rowling co-wrote with Junot Diaz, and Junot Diaz co-wrote a book with Zadie Smith, then I would want all three authors in the same group.

Here's a toy data set (h/t Matthew Dowle) with some of the relationships I am talking about:

set.seed(1)
authors <- replicate(100,sample(1:3,1))
book_id <- rep(1:100,times=authors)
author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE)
aubk <- data.table(author_id = author_id,book_id = book_id)
aubk[order(book_id,author_id),]

Here one sees that authors 27 and 36 co-wrote book 2, so they should be in the same group. The same for authors 63 and 100 for 3; and D, F and L for 4. And so on.

I can't think of a good way to do this other than a for-loop, which (as you can guess) is slow. I tried a bit of data.table to avoid unnecessary copying. Is there a better way of doing it?

aubk$group <- integer(dim(aubk)[1])
library(data.table)
aubk <- data.table(aubk)
#system.time({
for (x in 1:dim(aubk)[1]) {
    if(identical(x,1)) {
        value <- 1L
    } else {
        sb <- aubk[1:(x-1),]
        index <- match(aubk[x,author_id],sb[,author_id])
        if (identical(index,NA_integer_)) {
            index <- match(aubk[x,book_id],sb[,book_id])
            if (identical(index,NA_integer_)) {
                value <- x
            } else {
                value <- aubk[index,group]
            }
        } else {
            value <- aubk[index,group]
        }
    }
    aubk[x,group:=value]
}
#})

EDIT: As mentioned by @Josh O'Brien and @thelatemail, my problem can also be worded as looking for the connected components of a graph from a two-column list where every edge is a row, and the two columns are the nodes connected.

Blue Magister
  • 13,044
  • 5
  • 38
  • 56
  • If this question is about SQL, could you include the database engine? If not, please remove the SQL tag. – Gordon Linoff Sep 27 '12 at 21:56
  • I think I will eventually need to implement this in SQL, but I can ask that question separately. Removing tag. – Blue Magister Sep 27 '12 at 21:58
  • . . the solution in SQL will be harder than you think and highly dependent on the database engine. – Gordon Linoff Sep 27 '12 at 21:59
  • 1
    The example data appears to have 6 unique books with 6 unique authors, a 6 row dataset. Can you not provide some data which contains some links like you describe, along with a non trivial example of what the result should be? e.g. perhaps randomly pick between 1 and 3 authors, for 100 books, from a list of 50 authors, with `set.seed(1)` first. A `data.table` or hash table solution does seems like the way to go (not SQL). – Matt Dowle Sep 27 '12 at 22:39
  • @MatthewDowle See impending edit. – Blue Magister Sep 27 '12 at 23:56
  • I still think the example data, as you end up with a group of 26 authors. The pool of available authors should be larger than 26. – mnel Sep 28 '12 at 02:47
  • 2
    I'd use `RBGL::connectedComp()` to solve this, as demonstrated (in other applications) [here](http://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together/12136539#12136539) and [here](http://stackoverflow.com/questions/12294185/how-to-create-new-polygons-by-simplifying-from-two-spatialpolygonsdataframe-obje/12327602#12327602) – Josh O'Brien Sep 28 '12 at 06:40
  • @JoshO'Brien Intriguing - I didn't realize that my problem is a graph problem. I'd need to be careful because my graph is bipartite, and author_id's might collide with book_id's, but I can fix that by mapping each ID number to `2x` and `2x+1`. – Blue Magister Sep 28 '12 at 14:23
  • @mnel I increased the pool of authors to 100. – Blue Magister Sep 28 '12 at 14:31

3 Answers3

3

Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph. The RBGL package isn't updated for R version 2.15.1, so that was out as well.

After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.

aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
    aubk[,grp := min(grp),by=author_id]
    aubk[,grp := min(grp), by=book_id]
    num.grp.new <- aubk[,length(unique(grp))] 
    if(num.grp.new == num.grp.old) {break}
    num.grp.old <- num.grp.new
    iterations <- iterations + 1
}
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
1

Here's a go re-hashing my answer to an old question of mine that Josh O'Brien linked in the comments ( identify groups of linked episodes which chain together ). This answer uses the igraph library.

# Dummy data that might be easier to interpret to show it worked
# Authors 1,2 and 3,4 should group. author 5 is a group to themselves
aubk <- data.frame(author_id=c(1,2,3,4,5),book_id=c(1,1,2,2,5))

# identify authors with a bit of leading text to prevent clashes 
# with the book ids
aubk$author_id2 <- paste0("au",aubk$author_id)

library(igraph)
#create a graph - this needs to be matrix input
au_graph <- graph.edgelist(as.matrix(aubk[c("author_id2","book_id")]))
# get the ids of the authors
result <- data.frame(author_id=names(au_graph[1]),stringsAsFactors=FALSE)
# get the corresponding group membership of the authors
result$group <- clusters(au_graph)$membership

# subset to only the authors data
result <- result[substr(result$author_id,1,2)=="au",]
# make the author_id variable numeric again
result$author_id <- as.numeric(substr(result$author_id,3,nchar(result$author_id)))

> result
  author_id group
1         1     1
3         2     1
4         3     2
6         4     2
7         5     3
Community
  • 1
  • 1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
0

A couple of suggestions

aubk[,list(author_list = list(sort(author_id))), by = book_id]

will give a list of author groups

The followingwill create a unique identifier for each group of authors and then return a list with

  • the number of books
  • A list of the book ids
  • A unique identifier of the book_ids
  • number of authors

for each unique group of authors

aubk[, list(author_list = list(sort(author_id)), 
            group_id = paste0(sort(author_id), collapse=','), 
            n_authors = .N),by =  book_id][,
        list(n_books = .N, 
             n_authors = unique(n_authors), 
             book_list = list(book_id), 
             book_ids = paste0(book_id, collapse = ', ')) ,by = group_id]

If the author order matters, just remove the sort with the definitions of author_list and group_id

EDIT

noting that the above, while useful does not do the appropriate grouping

Perhaps the following will

# the unique groups of authors by book
unique_authors <- aubk[, list(sort(author_id)), by = book_id]
# some helper functions
# a filter function that allows arguments to be passed
.Filter <- function (f, x,...) 
{
  ind <- as.logical(sapply(x, f,...))
  x[!is.na(ind) & ind]
}

# any(x in y)?
`%%in%%` <- function(x,table){any(unlist(x) %in% table)}
# function to filter a list and return the unique elements from 
# flattened values
FilterList <- function(.list, table) {
  unique(unlist(.Filter(`%%in%%`, .list, table =table)))
}

# all the authors
all_authors <- unique(unlist(unique_authors))
# with names!
setattr(all_authors, 'names', all_authors)
# get for each author, the authors with whom they have
# collaborated in at least 1 book
lapply(all_authors, FilterList, .list = unique_authors)
mnel
  • 113,303
  • 27
  • 265
  • 254
  • It's a great start! But IIUC if A coauthored with B on one book, and B coauthored with C on another book, that's not returning the group A,B,C (requirement in 1st paragraph) is it? – Matt Dowle Sep 28 '12 at 00:49