Read Pajek .net files using Graph-tool

Question

I have a Pajek network file (undirected network with weighted edges), for which an example is provided here:

*Vertices 5
1  apple
2  cat
3  tree
4  nature
5  fire
*Edges
1  3  14
2  4  1

Node labels are provived without quoting. Edges are specified as node1, node2, edge weight.

I would need to read this file in graph-tool as an undirected graph with node labels and the "weight" attribute for edges. The function should also preserve isolate nodes.

Is there an efficient way to do this in Python? So far I have been reading the .net file with Networkx and then using a conversion function like this. I am looking for a way to speed up the process.

Stuart Berg · Accepted Answer · 2020-10-30T16:54:05.070

It appears that each section (Vertices/Edges) of the Pajek file can be interpreted as a space-delimited CSV file, which means you can parse it with pandas.read_csv(). That function is faster than the line-by-line parsing you suggested in your pure-python answer.

Also, it's faster to initialize the edge list and property lists all at once (as numpy arrays) rather than setting each element individually in a python loop.

I think the following implementation ought to be somewhat close to optimal, but I haven't benchmarked it.

import re
from io import StringIO

import numpy as np
import pandas as pd

import graph_tool as gt

def pajek_to_gt(path, directed=False, remove_loops=False):
    """
    Load a Pajek .NET file[1] as a graph_tool.Graph.
    Supports files which specify their edges via node pairs.
    Does not support files which specify their edges via the
    'edgeslist' scheme (i.e. the neighbors-list scheme).

    Note:
        Vertices are renumbered to start with 0, per graph-tool
        conventions (not Pajek conventions, which start with 1).

    Author: Stuart Berg (github.com/stuarteberg)
    License: MIT

    [1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
    """
    # Load into RAM
    with open(path, 'r') as f:
        full_text = f.read()

    if '*edgeslist' in full_text:
        raise RuntimeError("Neighbor list format not supported.")

    # Erase comment lines
    full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)

    # Erase blank lines (including those created by erasing comments)
    full_text = re.sub(r'\n+', '\n', full_text)

    # Ensure delimiter is a single space
    full_text = re.sub(r'[ \t]+', ' ', full_text)

    num_vertices = int(StringIO(full_text).readline().split()[-1])

    # Split into vertex section and edges section
    # (Vertex section might be empty)
    vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]

    # Parse vertices (if present)
    v_df = None
    if vertex_text:
        v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
        assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
            "File does not list all vertices, or lists them out of order."

    # Parse edges
    e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
    if len(e_df.columns) == 2:
        e_df.columns = ['v1', 'v2']
    elif len(e_df.columns) == 3:
        e_df.columns = ['v1', 'v2', 'weight']
    else:
        raise RuntimeError("Can't understand edge list")

    e_df[['v1', 'v2']] -= 1

    # Free up some RAM
    del full_text, vertex_text, edges_text

    # Create graph
    g = gt.Graph(directed=directed)
    g.add_vertex(num_vertices)
    g.add_edge_list(e_df[['v1', 'v2']].values)

    # Add properties
    if 'weight' in e_df.columns:
        g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
    if v_df is not None:
        g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)

    if remove_loops:
      gt.stats.remove_self_loops(g)

    return g

Here's what it returns for your example file:

In [1]: from pajek_to_gt import pajek_to_gt

In [2]: g = pajek_to_gt('pajek-example.NET')

In [3]: g.get_vertices()
Out[3]: array([0, 1, 2, 3, 4])

In [4]: g.vertex_properties['label'].get_2d_array([0])
Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')

In [5]: g.get_edges()
Out[5]:
array([[0, 2],
       [1, 3]])

In [6]: g.edge_properties['weight'].get_array()
Out[6]: PropertyArray([14.,  1.])

Note: This function does some preprocessing to convert double-spaces into single-spaces, since your example above uses double-spaces between entries. Was that intentional? The Pajek file specification you linked to uses single-spaces.

Edit:

Upon re-reading the Pajek file spec you linked to, I notice that there are two possible formats for the edges section. The second format lists each node's neighbors, in a variable-length list:

*edgeslist
4941 386 395 451
1 3553 3586 3587 3637
2 3583
3 4930
4 88
5 13 120

Obviously, my implementation above is not compatible with that format. I've edited the function to raise an exception if that format is used in the file.

Many thanks Stuart! I tested it with a network of about 43000 nodes and it seems that your solution is about 6 times faster than mine :) I was scared loading all the file in memory, in case one might have very big network files. — Forinstance, Oct 30 '20 at 11:33
I got an error with this line `gt.stats.remove_self_loops(g)`, so I changed it to `gts.remove_self_loops(g)`, with a previous import command `import graph_tool.stats as gts` — Forinstance, Oct 30 '20 at 11:37
As for the separator, I see files sometimes use single spaces, sometimes multiple spaces (2 or more) and other times tabs. But it seems to me that your function will work properly in all these cases? — Forinstance, Oct 30 '20 at 12:39
Perhaps the discrepancy w.r.t. `gt.stats` is due to the particular version of `graph-tool` being used. FWIW, I tested with `2.33`, installed from conda-forge. — Stuart Berg, Oct 30 '20 at 16:35
Yep, the preprocessing regex ought to handle multiple spaces or tab characters. — Stuart Berg, Oct 30 '20 at 16:37
Strange for the version, I was getting the error in Google Colab for which I think I use the latest version through these commands `!echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list !apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25 !apt-get update !apt-get install python3-graph-tool python3-cairo python3-matplotlib` — Forinstance, Oct 30 '20 at 16:53
I think the mistake is mine -- I forgot to include the import statements in the code! Sorry about that. (Now fixed.) As you can see, I used `import graph_tool as gt`, not `graph_tool.all`. Maybe that's the difference. — Stuart Berg, Oct 30 '20 at 16:55
Hi @StuartBerg, can I ask you if the same approach can be applied in case of partitions? https://stackoverflow.com/questions/74368436/reading-a-pajek-file-with-partitions?noredirect=1#comment131289791_74368436 thanks — LdM, Nov 09 '22 at 11:04

Forinstance · Answer 2 · 2020-10-29T15:44:46.037

This is the solution I developed today:

import graph_tool.all as gt
import graph_tool.stats as gts

def pajTOgt(filepath, directed = False, removeloops = True):
  if directed:
    g = gt.Graph(directed=True)
  else:
    g = gt.Graph(directed=False)

  #define edge and vertex properties
  g.edge_properties["weight"] = g.new_edge_property("double")
  g.vertex_properties["id"] = g.new_vertex_property("string")

  with open(filepath, encoding = "utf-8") as input_data:
    #create vertices
    for line in input_data:
        g.add_vertex(int(line.replace("*Vertices ", "").strip())) #add vertices
        break

    #label vertices
    for line in input_data: #keeps going for node labels
      if not line.strip() == '*Edges' or line.strip() == '*Arcs':  
        v_id = int(line.split()[0]) - 1
        g.vertex_properties["id"][g.vertex(v_id)] = "".join(line.split()[1:])
      else:
        break

    #create weighted edges
    for line in input_data: #keeps going for edges
      linesplit = line.split()
      linesplit = [int(x) for x in linesplit[:2]] + [float(linesplit[2])]
      if linesplit[2] > 0:
        n1 = g.vertex(linesplit[0]-1)
        n2 = g.vertex(linesplit[1]-1)
        e = g.add_edge(n1, n2)
        g.edge_properties["weight"][e] = linesplit[2]

    if removeloops:
      gts.remove_self_loops(g)

    return g

Still if you find something more efficient, I'd be curious to know.

Read Pajek .net files using Graph-tool

2 Answers2