0

I'm trying to compute and plot the out- and in-degree distributions for the wikipedia vote network (contained into the SNAP collection of network datasets). This is a directed graph, represented as a edge list.

To read and store the graph data:

%Read the data file.
G = importdata('Wiki-Vote.txt', '   ', 4); 

%G is a structure that contains:
% - data: a <num_of_edges,2> matrix filled with node (wiki users) ids
% - textdata: a cell matrix that contains the header strings (first 4
%   lines).
% - colheaders: a cell matrix that contains the last descriptive string
%   (fourth line).
%All the useful information is contained into data matrix.

%Split directed edge list into 'from' and 'to' nodes lists.
Nfrom = G.data(:,1); %Will be used to compute out-degree
Nto = G.data(:,2);   % "..." in-degree

Motivated by this question, I followed this way to compute the out-degree

%Remove duplicate entries from Nfrom and Nto lists.
Nfrom = unique(Nfrom); %Will be used to compute the outdegree distribution.
Nto = unique(Nto);     %Will be used to compute the indegree distribution.

%Out-degree: count the number of occurances of each element (node-user id)
%contained into Nfrom to G.data(:,1).
outdegNsG = histc(G.data(:,1), Nfrom);
odG = hist(outdegNsG, 1:size(Nfrom));

figure;
plot(odG)
title('linear-linear scale plot: outdegree distribution');
figure;
loglog(odG)
title('log-log scale plot: outdegree distribution');

Same things to do for computing the in-degree. But the linear plot I take is far than satisfying and made me wondering if my approach is not the correct one.

In linear scale:

enter image description here

In log-log scale:

enter image description here

A zoom into distribution's graph in linear scale makes it clear that is close to a power law:

enter image description here

My question is if my approach to compute the degree distribution is the correct one, as I have not any help to ensure this. Specifically, I want to know if a smaller number of bins in histc will give a more clear graph without losing any valueable info.

Community
  • 1
  • 1
Kapoios
  • 688
  • 7
  • 22
  • 1
    With histograms you always have a trade-off: less bins give less resolution in the x-axis, but also reduce noise in the y-axis – Luis Mendo Nov 04 '13 at 11:04
  • Perhaps use wider bins as you move to the right? For example, choose bin edges with `logspace` instead of linearly. That way you will reduce noise in the right-hand part of the graph – Luis Mendo Nov 04 '13 at 11:17
  • I'm on the process to do logarithmic (exponential?) binning. The reason that I asked this question is to be ensured that my current methodology is correct and the generated noise is a matter of the chosen dataset. – Kapoios Nov 04 '13 at 11:21
  • Well, you should always expect noise in a histogram; and more so the less elements per bin you have – Luis Mendo Nov 04 '13 at 11:26
  • you could also do a library such as [NetworkX](http://networkx.github.io/) in python or [this one from MIT](http://strategic.mit.edu/downloads.php?page=matlab_networks) for matlab. – andy mcevoy Nov 04 '13 at 14:08
  • I'm not permitted to use any code that is published on the web. :( – Kapoios Nov 04 '13 at 14:37

1 Answers1

0

Okay... My previous approach would be correct if I wanted to plot the out- (or in-) degree of each node, not the degree distribution...

For out-degree distribution:

Nfrom = G.data(:,1); %Will be used to compute out-degree
Nfrom = unique(Nfrom); %Will be used to compute the outdegree distribution.
outdegNsG = histc(G.data(:,1), Nfrom);
outdd = histc(outdegNsG, unique(outdegNsG));

so, I should plot:

loglog(1:length(outdd),outdd);

Same for indegree...

Kapoios
  • 688
  • 7
  • 22
  • May I ask something? Why do not use `[outdd, centers ]= histc(outdegNsG, unique(outdegNsG));` and then `loglog(centers ,outdd);`? – Thoth Jul 28 '14 at 21:01