It appears that each section (Vertices/Edges) of the Pajek file can be interpreted as a space-delimited CSV file, which means you can parse it with pandas.read_csv()
. That function is faster than the line-by-line parsing you suggested in your pure-python answer.
Also, it's faster to initialize the edge list and property lists all at once (as numpy arrays) rather than setting each element individually in a python loop.
I think the following implementation ought to be somewhat close to optimal, but I haven't benchmarked it.
import re
from io import StringIO
import numpy as np
import pandas as pd
import graph_tool as gt
def pajek_to_gt(path, directed=False, remove_loops=False):
"""
Load a Pajek .NET file[1] as a graph_tool.Graph.
Supports files which specify their edges via node pairs.
Does not support files which specify their edges via the
'edgeslist' scheme (i.e. the neighbors-list scheme).
Note:
Vertices are renumbered to start with 0, per graph-tool
conventions (not Pajek conventions, which start with 1).
Author: Stuart Berg (github.com/stuarteberg)
License: MIT
[1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
"""
# Load into RAM
with open(path, 'r') as f:
full_text = f.read()
if '*edgeslist' in full_text:
raise RuntimeError("Neighbor list format not supported.")
# Erase comment lines
full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)
# Erase blank lines (including those created by erasing comments)
full_text = re.sub(r'\n+', '\n', full_text)
# Ensure delimiter is a single space
full_text = re.sub(r'[ \t]+', ' ', full_text)
num_vertices = int(StringIO(full_text).readline().split()[-1])
# Split into vertex section and edges section
# (Vertex section might be empty)
vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]
# Parse vertices (if present)
v_df = None
if vertex_text:
v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
"File does not list all vertices, or lists them out of order."
# Parse edges
e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
if len(e_df.columns) == 2:
e_df.columns = ['v1', 'v2']
elif len(e_df.columns) == 3:
e_df.columns = ['v1', 'v2', 'weight']
else:
raise RuntimeError("Can't understand edge list")
e_df[['v1', 'v2']] -= 1
# Free up some RAM
del full_text, vertex_text, edges_text
# Create graph
g = gt.Graph(directed=directed)
g.add_vertex(num_vertices)
g.add_edge_list(e_df[['v1', 'v2']].values)
# Add properties
if 'weight' in e_df.columns:
g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
if v_df is not None:
g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)
if remove_loops:
gt.stats.remove_self_loops(g)
return g
Here's what it returns for your example file:
In [1]: from pajek_to_gt import pajek_to_gt
In [2]: g = pajek_to_gt('pajek-example.NET')
In [3]: g.get_vertices()
Out[3]: array([0, 1, 2, 3, 4])
In [4]: g.vertex_properties['label'].get_2d_array([0])
Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')
In [5]: g.get_edges()
Out[5]:
array([[0, 2],
[1, 3]])
In [6]: g.edge_properties['weight'].get_array()
Out[6]: PropertyArray([14., 1.])
Note: This function does some preprocessing to convert double-spaces into single-spaces, since your example above uses double-spaces between entries. Was that intentional? The Pajek file specification you linked to uses single-spaces.
Edit:
Upon re-reading the Pajek file spec you linked to, I notice that there are two possible formats for the edges section. The second format lists each node's neighbors, in a variable-length list:
*edgeslist
4941 386 395 451
1 3553 3586 3587 3637
2 3583
3 4930
4 88
5 13 120
Obviously, my implementation above is not compatible with that format. I've edited the function to raise an exception if that format is used in the file.