3

Suppose I have the following directed acyclic graph (DAG) with each node having a weight of 1.

simple DAG

I am interested in calculating the accumulated sum of each node based on the value of its ancestor. Assuming as I said earlier that the weight of each node is 1, then this is what I would expect to get

cumulative sum per node

This is what I tried to do:

 library(tidygraph, quietly = TRUE) 
 library(tidyverse)
 library(ggraph)

 # create adjacencies
 grafo_df <- tribble(
  ~from, ~to,
  "C", "A",
  "C", "B",
  "A", "D",
  "B", "D")
 
 # create the graph
 grafo <- as_tbl_graph(grafo_df)
 
 
 # calculate accumulated sum
 grafo %>% 
  arrange(node_topo_order()) %>% 
  mutate(
   
   revenue = 1,
   
   cum_weight = map_dfs(1, .f = function(node, path, ...) {
    
    sum(.N()$revenue[c(node, path$node)])
    
   })) %>% 
  as_tibble() %>% 
  unnest("cum_weight")
 
#> # A tibble: 4 x 3
#>   name  revenue cum_weight
#>   <chr>   <dbl>      <dbl>
#> 1 C           1          1
#> 2 A           1          2
#> 3 B           1          2
#> 4 D           1          3

Created on 2021-05-13 by the reprex package (v2.0.0)

As you can see, the accumulated sum of D results in 3 and not 4, because the value of D should be the sum of the accumulated value of A and B. I do not understand why D does not add 4

I have tried to understand the solution given here, but had a hard time understanding it

How can I get the accumulated sum?

Update # 1

I am not concerned (for the moment) with the complexity of the algorithm, that is, if the algorithm does it in O(V + E) it is not relevant.

Something important that is mentioned in this question is about the problem of counting twice, that is, the partial sum of the value of A is equal to C(1) + A(1) = 2, and the partial sum of the value of B is equal to C(1) + B (1) = 2, so to say that the value of D is not equal to the partial sums of A (2) + B(2) because the value of C would be duplicating I think it does not apply in this situation due to the following:

Let's imagine that each of these 4 nodes (A, B, C and D) are internet nodes that generate revenue of $1 each, so the total accumulated income of the 4 nodes would be $4. If D is the convergence node of the rest of nodes, then in a scenario where D stops working, the income of the remaining nodes and that of D would no longer be possible, therefore, its value is $4.

Update # 2

If I add a new path from C to D then the value of D should always be 4 because the number of dependent nodes is maintained, that is, what should matter is the number of dependent nodes in the accumulated sum. For example, in the solution proposed by @ThomasIsCoding, if I add this new path, the value of D is now 5, I think partly that their algorithm uses the degrees as a parameter to calculate the cumulative sum, however, if I add a additional node then the calculation is correct.

Update # 3

The example that I have placed is simple with the intention that it is easy to understand the objective, however, I did not specify that it should be generalizable for a graph with many nodes with three different topologies. The outermost layers are trees, the middle layers are rings, and the innermost layer is a full mesh.

William
  • 164
  • 11
  • How do you get the accumulated sum for Node A to be 2? Should it not be 1? – Onyambu May 14 '21 at 04:06
  • 1
    As I understand D should be 5, not 4, 4 is only the sum of ancestor without the node value. – Rocco May 14 '21 at 09:54
  • @Onyambu. Because it is the accumulated sum, when applying the sum function through map_dfs, the value of A is added plus the value of its ancestor. All are worth 1 at the start. When we start to iterate to calculate the accumulated, the value of A will be the value that A currently has plus the value of its ancestor, which in this case is C, therefore 1 + 1 = 2. – William May 14 '21 at 22:27
  • @Rocco. The scenario or context in which it should be worth 4 is as follows: Suppose these 4 nodes are revenue-generating sites (in this case, suppose they generate $ 1). As you can see, nodes C, B and A depend on D being working, since all the "traffic" goes through D. What would happen if D crashed or stopped working? Well, we would have a loss of income of $ 4, which would be what generates C, A, B and D. Under this assumption, the value or accumulated sum of D should be 4. – William May 14 '21 at 22:31
  • Your answer to @Rocco's question doesn't make sense. You compute C's sum by adding C's own value (1) to the sum of its ancestors (also 1), so why do you not do the same for D? If you applied this rule consistently (i.e., if you did the same thing for D as you do for C), it would indeed result in D's sum being 2+2+1=5. – j_random_hacker May 19 '21 at 00:57
  • Now that I see what your application is, I think the "cumulative sum of ancestors" isn't what you want to be computing, because it will count some nodes twice or more. In your example, where for reasons I don't understand you don't count A at all, things happen to cancel out because you count C twice, so you get the answer you were hoping to see even though you calculate it the wrong way, but that won't happen in general, as your own example of adding a CD edge shows. – j_random_hacker May 19 '21 at 01:04
  • @j_random_hacker "C" has no ancestor. I tried to explain about the issue of partial sums in update # 1. I understand perfectly that it is easy to consider that I am counting "C" twice, but as I put it in the example the use case to do it this way is valid for the application I need, although since I am a newbie to the graph part, I would agree with you that maybe I misdescribed my problem and shouldn't call it cumulative sum of ancestor. How do you suggest I call this problem? – William May 19 '21 at 03:22
  • I read Update #1, but it doesn't clarify for me what you're actually trying to achieve -- it only talks about a hypothetical situation ("to say that ...") which you then go on to say "does not apply". Well, what *does* apply? Do you want to count ancestors multiple times or not? Reading your "internet revenue" explanation, I *think* the answer is "no", but I'm not sure because there are interpretations for which counting C twice would be the right thing to do. – j_random_hacker May 19 '21 at 04:23
  • Maybe your answer to the following question will clear things up: Is partialSum(D) meant to be 4 because it has 3 ancestors in total, each with weight 1, plus it has weight 1 itself? – j_random_hacker May 19 '21 at 04:25
  • @j_random_hacker Yes. – William May 19 '21 at 17:05

2 Answers2

1

Here is an igraph option using distance with argument mode = "in"

  • If your nodes are unweighted, i.e., revenue=1 for all nodes
g <- graph_from_data_frame(grafo_df)

data.frame(name = names(V(g))) %>%
  mutate(revenue = 1) %>%
  mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))

which gives you

  name revenue cum_weight
1    C       1          1
2    A       1          3
3    B       1          2
4    F       1          1
5    D       1          5
  • If your nodes are weighted, e.g.,
data.frame(name = names(V(g))) %>%
  mutate(revenue = 1:n()) %>%
  mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))

which gives you

  name revenue cum_weight
1    C       1          1
2    A       2          7
3    B       3          4
4    F       4          4
5    D       5         15

Data

grafo_df <- tribble(
  ~from, ~to,
  "C", "A",
  "C", "B",
  "A", "D",
  "C", "D",
  "B", "D",
  "F", "A"
)

and the DAG by plot(g) is given as

enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • adding an additional node works great! However, if I add a new path, say from C to D, then it updates the value of D to 5, which would not be correct. Similarly, the focus on your code might help me think of a way to fix it, however, I am not familiar with `igraph` – William May 17 '21 at 15:32
  • @William Thanks for your feedback. I guess I understood your problem incorrectly and made it too complicated. I updated my solution and hope it works for your now. – ThomasIsCoding May 17 '21 at 21:46
  • It works, but as I put in my update # 3 it seems that I have failed to explain that the solution must be generalizable so that it calculates the cumulative sum for a graph with different topologies. For example, if I add F -> A in the tibble, the final value of D should now give 6, however it gives 5. – William May 19 '21 at 00:32
  • @William I didn't see why D is 6. D has 4 ancestors. Could you explain a bit. I updated my solution based on your update3. – ThomasIsCoding May 19 '21 at 01:05
  • My mistake. When I was testing I had first added an additional node "E" and later I tested F -> A. I will test your solution by adding different topologies, including isolated nodes and meshes. I will let you know about the results. – William May 19 '21 at 03:13
  • @William I updated my answer with a much simpler one, you can check it out. – ThomasIsCoding May 19 '21 at 20:14
  • I have reviewed your previous solution in a few scenarios and it is powerful. Not yet verified your new approach. – William May 21 '21 at 22:57
  • 1
    @William I think the new approach should work as well, but it is more efficient. – ThomasIsCoding May 22 '21 at 20:17
  • Thanks i will try it – William May 25 '21 at 16:33
0

Now the question is clear, so I propose an algorithm, I cannot code it since I don't know the language that you are using.

For each node Ni in the graph we will calculate the set of ancestors Ai, then the accumulated sum for each node will be |Ai| + 1.

  1. Initialize all nodes with an empty ancestor set Ai = {}
  2. Start with a set S0 containing all nodes with no incoming edges
  3. Initialize the next set Sn+1
  4. Iterate over Sn, for each node N:
  5. For all nodes D with an incoming edge from N:
    1. Merge the ancestor set of D with the ancestor set of N plus N itself
    2. remove the egde N->D
  6. If D has no other incoming edges add it to Sn+1
  7. If Sn+1 is not empty, increase pass to n+1 and repeat from 2.

The big limit of this solution is the complexity, I'll try later to find some optimized solution.

Rocco
  • 1,098
  • 5
  • 11