0

I have the following data frames in R:

Id   Class
@a    64
@b    7
@c    98 

And the second data frame:

SOURCE    TARGET 
@d        @b
@c        @a 

This is describes the nodes and the edges in a social network. The users (all with @ in front) belong to a specific community and the number is listed in column class. To analyse the connections between the columns I want to merge this data frames and create a new data frame looking like this:

SOURCE    TARGET    SOURCE.Class    TARGET.Class 
@a        @i        56               2
@f        @k        90               49 

When I try merge() R stop responding and I need to terminate R. The data frames constitute 20000 (node file) and 30000 (edge file) rows.

Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.

I will be so happy if someone could help me since I'm very new to R.

EDIT: I think I manage to create the columns by this code using match() instead of merge() (rt_node contain the columns "id", "class" and rt_node contain the columns "source","target"):

#match source in rt_edges with id in rt_node
match(rt_edges$Source,rt_nodes$id)

#match target in rt_edges with id in rt_node
match(rt_edges$Target,rt_nodes$id)

#create source_class 
rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]
rt_edges$Source_Class=rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]

#create target_class
rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]
rt_edges$Target_Class=rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]

Now I just need to figure out how I can find the percentage of connections in each class and the percentage of connections with other classes. Any tips on how to do that?

lmo
  • 37,904
  • 9
  • 56
  • 69
staanR
  • 35
  • 6
  • Please check the structure of the datasets. ie.. `str(dat1)` – akrun Apr 07 '17 at 10:18
  • Both datasets are data.frame with two variables and different number of rows. – staanR Apr 07 '17 at 10:32
  • 2
    Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – nrussell Apr 07 '17 at 11:01
  • 2
    `When I try merge() R stop responding...` could you add the code as well? – zx8754 Apr 07 '17 at 11:05
  • These are two questions. (1) Why isn't `merge` working, Here, the code causing problems should be shown. (2) _How many records in a given source class have the same target class and percentage of connections between classes_ This should be posted as a separate question giving the expected result and matching sample data. Thank you. – Uwe Apr 09 '17 at 09:52

1 Answers1

1

Question 1: Merge

This requires two separate join operations: An initial join of rt_edges with rt_nodes on Target and a subsequent join of the intermediate result with rt_nodes on Source. In addition, all rows of rt_edges should appear in the result.

The approach below uses data.table. (I've adopted the naming of variables and columns the OP has used in the edited code of his Q but note that this is inconsistent to the sample data given by the OP.)

Reading data

library(data.table)
rt_nodes <- fread(
  "id   Class
  @a    64
  @b    7
  @c    98
  @d    23
  @f    59")
rt_edges <-fread(
  "Source    Target 
  @d        @b
  @c        @a
  @a        @e")

Note that additional rows have been added to the sample data provided by the OP to demonstrate the effect of

  • a node (@f) not involved in an edge and
  • an edge (@a -> @e) where one id is missing from rt_nodes.

Twofold join

By default, joins in data.table are right joins. Therefore, rt_edges appears on the right side.

result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "Target")], on = c(id = "Source")]

# rename columns
setnames(result, c("Source", "Source.Class", "Target", "Target.Class"))

result
#   Source Source.Class Target Target.Class
#1:     @d           23     @b            7
#2:     @c           98     @a           64
#3:     @a           64     @e           NA

All three edges appear in the result. The NA indicates that @e is missing from rt_nodes.

Question 2

The OP has included a second question (and has also created a new post in the meantime)

Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.

result[, .(.N, share_of_occurrence_in_Target.Class = sum(Source.Class == Target.Class)/.N), 
       by = Source.Class]
#   Source.Class N share_of_occurrence_in_Target.Classs
#1:           23 1                                    0
#2:           98 1                                    0
#3:           64 1                                   NA

The counts are 1 and the shares are 0 here because the sample data don't include enough cases of matching classes. However, the code has been verified to work with the data provided in the other post of the OP.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134