4

I have data on every interaction that could and did happen at a university club weekly social hour

A sample of my data is as follows

structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L, 
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", 
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L, 
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L, 
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from", 
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA, 
-18L))

I am trying to calculate network statistics such as centrality for A,B,C for each individual week, the last two weeks, and year to date. The only way I have gotten this to work is by manually breaking up the file in the time unit I want to analyze but there has to be a less labourious way, I hope.

When timestalked is 0 this should be treated as no edge

The output would produce a .csv with the following:

actor  cent_week1 cent_week2 cent_week3 cent_last2weeks cent_yeartodate
 A       
 B
 C 

with cent_week1 being 1/1/2010 centrality; cent_last2weeks being just considering 1/8/2010 and 1/15/2010; and cent_yeartodate being all of the data being considered at once. This is being applied to a MUCH larger dataset of millions of observations.

CJ12
  • 487
  • 2
  • 10
  • 28
  • Post what you have tried so far that didn't work, and copy and paste the output of `dput(my_data)` instead of the way you have it formatted now. – acylam Oct 27 '17 at 13:23
  • @useR I have spent days searching the web and looking at tutorials with no luck. I resorted to manually breaking the csv intro hundreds of subfiles using C++. I then ran the needed analysis. So it is all done but for closure I think this is an important issue to have resolvement on. I understand if no one in the community knows how to do it. – CJ12 Oct 27 '17 at 13:25
  • I don't think it's the fact that this question is too difficult. I'm sure _someone_ knows how to solve it. It's how you formatted your data that makes it hard for people to work with (read this https://stackoverflow.com/a/5963610/5150629). If you want to get helpful answers, at least post data that people can work with by copy and pasting the output of `dput(my_data)` as well as provide what you expect the final output to look like. – acylam Oct 27 '17 at 13:31
  • @useR Makes sense, updated – CJ12 Oct 27 '17 at 14:06
  • is this the sort of thing you want: to get a graph for each time slot `b = by(d, d$week, FUN=graph_from_data_frame)`, and then run functions over them `sapply(b, function(x) eigen_centrality(x, weights = E(x)$timestalked)$vector)` (not sure if thats sensible) – user20650 Oct 29 '17 at 23:52
  • @user20650 This seems in line with what I am asking for, which boils down to a dataset that looks like the output in my question. If could could turn the comment into an answer doing that, this would suffice. The ability to graph by week and cumulatively would also be helpful. – CJ12 Oct 30 '17 at 14:46
  • Could you tell us if @user20650's answer is satisfactory? – nghauran Oct 30 '17 at 16:52
  • @ANG It is not no, as it does not produce the desired output – CJ12 Oct 30 '17 at 17:36
  • @CJ12, please could you give more precisions concerning `week1`, `week2`, `week3`, `last2weeks` and `yeartodate`? `week1 == 1/1/2010`? `week2 == 1/8/2010`? `last2weeks == `?... – nghauran Oct 30 '17 at 17:49
  • @ANG Added more detail – CJ12 Oct 30 '17 at 18:03
  • @user20650 Sorry do not follow your comment. Feel free to post an answer that produce the above output – CJ12 Oct 30 '17 at 23:28

4 Answers4

1

Can't comment, so I'm writing an "answer". If you want to perform some mathematical operation on timestalked and get values by the from (didn't find any variable called actor in your example), here's a data.table approach that can be helpful:

dat <- as.data.table(dat) # or add 'data.table' to the class parameter
dat$week <- as.Date(dat$week, format = "%m/%d/%Y")
dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]

This gives the below output:

dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]

   from weeknum cent
1:    A       1  0.5
2:    A       2  2.0
3:    A       3  1.5
4:    B       1  0.5
5:    B       2  1.0
6:    B       3  0.5
7:    C       1  1.5
8:    C       2  0.5
9:    C       3  0.0

Assign this to new_dat. You can subset by week simply with new_dat[weeknum %in% 2:3] or whatever other variation you want or sum over the year. Additionally, you can also sort/order as desired.

Hope this helps!

Gautam
  • 2,597
  • 1
  • 28
  • 51
1

How about:

library(dplyr)
centralities <- tmp       %>% 
  group_by(week)          %>% 
  filter(timestalked > 0) %>% 
  do(
    week_graph=igraph::graph_from_edgelist(as.matrix(cbind(.$from, .$to)))
  )                       %>% 
  do(
    ecs = igraph::eigen_centrality(.$week_graph)$vector
  )                       %>% 
  summarise(ecs_A = ecs[[1]], ecs_B = ecs[[2]], ecs_C = ecs[[3]])

You can use summarise_all if you have a lot of actors. Putting it in long format is left as an exercise.

  • @dah2 Loading in the dataset from the question, I receive the following error with your code: `Error in eval(lhs, parent, parent) : object 'tmp' not found` – CJ12 Oct 31 '17 at 15:20
  • Obviously you have to load the `structure` in your question into the object `tmp`. –  Nov 01 '17 at 16:08
  • Obviously, I expected a complete answer using the supplied data. That is fine, if you can create the output as outlined in the question I am happy to accept it – CJ12 Nov 01 '17 at 18:12
  • lol...it is not jumping through hoops. Either you answer the question as posted or you don't. Attempting the latter is wasting your own time – CJ12 Nov 02 '17 at 01:05
1

Can do this by setting your windows in another table, then doing by group operations on each of the windows:

Data Preparation:

# Load Data
DT <- structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L, 
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", 
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L, 
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L, 
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from", 
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA, 
-18L))

# Code
library(igraph)
library(data.table)

setDT(DT)

# setup events
DT <- DT[timestalked > 0]
DT[, week := as.Date(week, format = "%m/%d/%Y")]

# setup windows, edit as needed
date_ranges <- data.table(label = c("cent_week_1","cent_week_2","cent_last2weeks","cent_yeartodate"),
                          week_from = as.Date(c("2010-01-01","2010-01-08","2010-01-08","2010-01-01")),
                          week_to = as.Date(c("2010-01-01","2010-01-08","2010-01-15","2010-01-15"))
)

# find all events within windows
DT[, JA := 1]
date_ranges[, JA := 1]
graph_base <- merge(DT, date_ranges, by = "JA", allow.cartesian = TRUE)[week >= week_from & week <= week_to]

Here is now the by group code, the second line is a bit gross, open to ideas about how to avoid the double call

graph_base <- graph_base[, .(graphs = list(graph_from_data_frame(.SD))), by = label, .SDcols = c("from", "to", "timestalked")] # create graphs
graph_base <- graph_base[, .(vertex = names(eigen_centrality(graphs[[1]])$vector), ec = eigen_centrality(graphs[[1]])$vector), by = label] # calculate centrality

dcast for final formatting:

dcast(graph_base, vertex ~ label, value.var = "ec")
   vertex cent_last2weeks cent_week_1 cent_week_2 cent_yeartodate
1:      A       1.0000000   0.7071068   0.8944272       0.9397362
2:      B       0.7052723   0.7071068   0.4472136       0.7134685
3:      C       0.9008487   1.0000000   1.0000000       1.0000000
Chris
  • 6,302
  • 1
  • 27
  • 54
  • this is great. (1) I have been thinking that the best output would columns would be `vertex` `date` `cent_this_week` `cent_last_two_weeks` and `cent_yeartodate` This would make the code more portable and I would be grateful if you knew a way to transpose this output in such a form. (2) is it possible to output the dcast into a `.csv` in the `wd`? I have been trying to both myself for the past hour with little to no progress. Thanks – CJ12 Nov 03 '17 at 18:12
  • Also, the real dataset has thousands of dates so not having to hand code them is needed – CJ12 Nov 03 '17 at 18:28
  • @CJ12 (1) I am not sure how to incorporate date in your output given that the column definitions are specific date regions. What does date in this case represent? (2) just use `write.csv()`, it will work on the dcasted value (3) you can likely generate this table programmatically from your data - what have you tried so far? – Chris Nov 03 '17 at 19:45
0

This analysis follows the general split-apply-combine approach, where the data re split by week, graph functions are applied, and then the results combined together. There are several tools for this, but below uses base R, and data.table.

Base R

First set data-class for your data, so that term last two weeks has meaning.

# Set date class and order
d$week <- as.Date(d$week, format="%m/%d/%Y")
d <- d[order(d$week), ]
d <- d[d$timestalked > 0, ] # remove edges // dont need to do this is using weights

Then split and apply graph functions

# split data and form graph for eack week
g1 <- lapply(split(seq(nrow(d)), d$week), function(i) 
                                                  graph_from_data_frame(d[i,]))
# you can then run graph functions to extract specific measures
(grps <- sapply(g1, function(x) eigen_centrality(x,
                                            weights = E(x)$timestalked)$vector))

#   2010-01-01 2010-01-08 2010-01-15
# A  0.5547002  0.9284767  1.0000000
# B  0.8320503  0.3713907  0.7071068
# C  1.0000000  1.0000000  0.7071068

# Aside: If you only have one function to run on the graphs, 
# you could do this in one step
# 
# sapply(split(seq(nrow(d)), d$week), function(i) {
#             x = graph_from_data_frame(d[i,])
#             eigen_centrality(x, weights = E(x)$timestalked)$vector
#           })

You then need to combine in the the analysis on all the data - as you only have to build two further graphs, this is not the time-consuming part.

fun1 <- function(i, name) {
            x = graph_from_data_frame(i)
            d = data.frame(eigen_centrality(x, weights = E(x)$timestalked)$vector)
            setNames(d, name)
    }


a = fun1(d, "alldata")
lt = fun1(d[d$week %in% tail(unique(d$week), 2), ], "lasttwo")

# Combine: could use `cbind` in this example, but perhaps `merge` is 
# safer if there are different levels between dates
data.frame(grps, lt, a) # or
Reduce(merge, lapply(list(grps, a, lt), function(x) data.frame(x, nms = row.names(x))))

#   nms X2010.01.01 X2010.01.08 X2010.01.15  alldata lasttwo
# 1   A   0.5547002   0.9284767   1.0000000 0.909899     1.0
# 2   B   0.8320503   0.3713907   0.7071068 0.607475     0.5
# 3   C   1.0000000   1.0000000   0.7071068 1.000000     1.0

data.table

It is likely that the time-consuming step will be explicitly split-applying the function over the data. data.table should offer some benefit here, especially when the data becomes large, and/or there are more groups.

# function to apply to graph
fun <- function(d) {
  x = graph_from_data_frame(d)
  e = eigen_centrality(x, weights = E(x)$timestalked)$vector
  list(e, names(e))
}

library(data.table)
dcast(
  setDT(d)[, fun(.SD), by=week], # apply function - returns data in  long format
  V2 ~ week, value.var = "V1")   # convert to wide format

#    V2 2010-01-01 2010-01-08 2010-01-15
# 1:  A  0.5547002  0.9284767  1.0000000
# 2:  B  0.8320503  0.3713907  0.7071068
# 3:  C  1.0000000  1.0000000  0.7071068

Then just run the function over the full data / last two weeks as before.

There are differences between the answers, which is down to how we use the use the weights argument when calculating the centralities, whereas the others don't use the weights.


d=structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L, 
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", 
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L, 
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L, 
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from", 
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA, 
-18L))
user20650
  • 24,654
  • 5
  • 56
  • 91