0

I have 100 nodes and 4950 edges. What is the fastest way to create a graph in Python (not planning at all to visualize or draw it) so that I can have access to node information so that I would need what each item in the 2d matrix mean by saying node 1 is connected to node 3? (also I don't need to save it as matrix).

import gensim
import nltk
from gensim.models import word2vec
from nltk.corpus import stopwords
import logging
import re
import itertools
import glob
from collections import defaultdict
import networkx as nx


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

sentences = word2vec.Text8Corpus("/home/mona/mscoco/text8")
model = word2vec.Word2Vec(sentences, workers = 16)
#model.init_sims(replace = True)
model_name = "text8_data"
model.save(model_name)

stopwords = nltk.corpus.stopwords.words('english')

path = "/home/mona/mscoco/caption_files/*.txt"
files = glob.glob(path)
adj_list = defaultdict(lambda: defaultdict(lambda: 0))


for file in files:
        g.add_nodes(file)

for file1, file2 in itertools.combinations(files, 2):
        with open(file1) as f1:
                f1_text = f1.read()
                f1_words = re.sub("[^a-zA-Z]", ' ', f1_text).lower().split()
                f1_words = [w for w in f1_words if w not in stopwords]
                print(f1_text)
                f1.close()
        with open(file2) as f2:
                f2_text = f2.read()
                f2_words = re.sub("[^a-zA-Z]", ' ', f2_text).lower().split()
                f2_words = [w for w in f2_words if w not in stopwords]
                print(f2_text)
                f2.close()
        print('{0}: {1}: {2}'.format(file1, file2, model.wmdistance(f1_words, f2_words)))
        g.add_edge(file1, file2, model.wmdistance(f1_words, f2_words))



print(g.number_of_edges())
print(g.number_of_edges())


nx.write_gml(g, "gensim.gml")

Please let me know if you have better suggestion that my current code. I will eventually have something like 20 nodes and 190 edges. I am mostly looking for something that processing its output would be easy to another program like MATLAB. I am not sure if .gml files are easy to process in MATLAB.

Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
  • With that density of edges it would make more sense to store the pairs of nodes that *aren't* connected. – John Coleman Oct 10 '16 at 22:23
  • This question is too open ended. How are you getting the list of nodes and edges? An adjacency matrix, list of nodes with connections to their neighbors, a list of pairs representing an edge are all valid options... – jaypb Oct 10 '16 at 22:29
  • Possible duplicate of [What is the most efficient graph data structure in Python?](http://stackoverflow.com/questions/1171/what-is-the-most-efficient-graph-data-structure-in-python) – jaypb Oct 10 '16 at 22:30
  • each node is a file name and the edge is the similarity score between them. @jaypb I have been thinking of defaultdict for creating an adjacency list but I am not sure if that's a good solution. I will have a complete graph. – Mona Jalal Oct 10 '16 at 22:38
  • @jaypb so nodes are like `/home/mona/mscoco/caption_files/test/captions_broccoli467145.txt` and `/home/mona/mscoco/caption_files/test/captions_orange137230.txt` and edge is like: 0.825396825397 – Mona Jalal Oct 10 '16 at 22:42
  • @JohnColeman: With that density of edges, there *are* no pairs of nodes that aren't connected. – user2357112 Oct 10 '16 at 22:59
  • You could use a dictionary keyed by pairs of nodes where the value stored is the similarity score. – John Coleman Oct 10 '16 at 23:36

1 Answers1

1

I think generating a GML file for the precise purpose of reusing in Matlab is probably fine. This question has some more information about that.

Convert GML file to adjacency matrix in matlab

Community
  • 1
  • 1
jaypb
  • 1,544
  • 10
  • 23
  • please have a look at the update question with code :) – Mona Jalal Oct 10 '16 at 23:04
  • 1
    Hopefully my new answer is helpful, and you learned an important lesson about giving enough information in your question. The question as it now stands is completely different than before :). – jaypb Oct 10 '16 at 23:08