1

I have a code that creates adjacencies between data from my text file. The file structure is pretty simple. It has only 2 columns which describe the connections between 2 nodes. For example:

ANALYTICAL_BALANCE BFG_DEPOSIT
CUSTOMER_DETAIL BALANCE
BFG_2056 FFD_15
BALANCE BFG_16
BFG_16 STAT_HIST
ANALYTICAL_BALANCE BFG_2056
CUSTOM_DATA AND_11
AND_11 DICT_DEAL
DICT_DEAL BFG_2056

I load the data right now into list.

data = [line.split() for line in open('data.txt', sep=' ')

I get list like this

data = [
    ["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
    ["CUSTOMER_DETAIL","BALANCE"],
    ["BFG_2056", "FFD_15"],
    ["BALANCE","BFG_16"],
    ["BFG_16","STAT_HIST"],
    ["ANALYTICAL_BALANCE","BFG_2056"],
    ["CUSTOM_DATA","AND_11"],
    ["AND_11","DICT_DEAL"],
    ["DICT_DEAL","BFG_2056"]
]

then I create the adjency list

def create_adj(edges):
    adj = {}   # or use defaultdict(list) to avoid `if` in the loop below
    for a, b in edges:
        if not a in adj:
            adj[a] = []
        if not b in adj:
            adj[b] = []
        adj[a].append(b)
    return adj

and return all paths

def all_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])

    for node in adj:
        yield from recur([node])

So for example that I gave earlier, the output data will be like this. I don't print the lists with length equal to 1.

adj = create_adj(data)

paths = all_paths(adj)

for i in paths:
    if len(i) > 1:
        print(i)
output:
['ANALYTICAL_BALANCE', 'BFG_DEPOSIT']
['ANALYTICAL_BALANCE', 'BFG_2056', 'FFD_15']
['CUSTOMER_DETAIL', 'BALANCE', 'BFG_16', 'STAT_HIST']
['BALANCE', 'BFG_16', 'STAT_HIST']
['BFG_2056', 'FFD_15']
['BFG_16', 'STAT_HIST']
['CUSTOM_DATA', 'AND_11', 'DICT_DEAL', 'BFG_2056', 'FFD_15']
['AND_11', 'DICT_DEAL', 'BFG_2056', 'FFD_15']
['DICT_DEAL', 'BFG_2056', 'FFD_15']

Everything is fine while the data set is small, but I have almost 13k rows of this connections in txt file. The compilation just takes too long. That's why I want to change all operations on lists to pandas dataframes. I don't know how because I don't have the experience with it. How would you do it ? Maybe you have better idea how I could implement my idea. I was thinking also about using Networkx, but I don't know how I could implement my code using it. Any help would be greatly appreciated!

jhylands
  • 984
  • 8
  • 16
neekitit
  • 61
  • 6
  • One of the issues you will come across is that the total number of paths will grow exponentially with the size of the graph. – jhylands Aug 24 '21 at 10:33
  • @jhylands I know this but I'm still hoping that changing lists to dataframes could speed up the process. Especially for bigger data. – neekitit Aug 24 '21 at 10:38
  • My understanding of this is that the slowest part will be the printing. That won't be sped up by using a different framework. If you are looking for a path within the list of paths then you could apply a heuristic at the level of path generation. As you seem to want *all* the paths printed then there isn't a quicker way of doing it. – jhylands Aug 24 '21 at 10:38
  • The other thing I can think of is keep all the information in memory, generate the printed output in a buffer and then print that buffer. So instead of `print(i)` do `acc += str(i)` then outside of the loop `print(acc)` – jhylands Aug 24 '21 at 10:40
  • Surly for such a large input you want the output to be stored in a file anyway rather than printed to the terminal? – jhylands Aug 24 '21 at 10:41
  • You might get a speed up using a default dict of set rather than of list – jhylands Aug 24 '21 at 10:44
  • @jhylands printing is just to show example. I filter the data later. The expected final output is a little different, because I also unfold the paths with lengths bigger then 2. So for this: ``` ['CUSTOMER_DETAIL', 'BALANCE', 'BFG_16', 'STAT_HIST'] ``` I unfold it to lists if two elements: ``` ['CUSTOMER_DETAIL', 'BALANCE'] ['CUSTOMER_DETAIL', 'BFG_16'] ['CUSTOMER_DETAIL', 'STAT_HIST'] ``` So we can look at CUSTOMER_DETAIL as a leaf connected to nodes and I want all my connections to be like this. That's why I can't use dict. And the final output is saved into excel with openpyxl. – neekitit Aug 24 '21 at 10:44
  • If you are filtering later, you are better off not generating all the paths but finding the paths with the properties you are filtering for – jhylands Aug 24 '21 at 10:46
  • I hope that I'm explaining it well. If I could I would show the whole code, but I can't show the code from work. – neekitit Aug 24 '21 at 10:50
  • Are you looking for disjoint clusters? The sets of node which are connected? – jhylands Aug 24 '21 at 10:52
  • https://stackoverflow.com/questions/1348783/finding-all-disconnected-subgraphs-in-a-graph – jhylands Aug 24 '21 at 10:52
  • I added the visualization for data that I gave earlier: We can assume that CUSTOM_DATA is a "leaf", because it doesn't have a predecessor and that it feeds into AND_11, and then AND_11 feeds into DICT_DEAL and so on. I'm not interested in infomation that AND_11 feeds DICT_DEAL. I only want information that CUSTOM_DATA is a primary node and that AND_11 and DICT_DEAL and so on, have the same base on this primary node. So I want connection like this in the end, but I don't know how to get it efficiently: CUSTOM DATA AND_11, CUSTOM_DATA DICT_DEAL – neekitit Aug 24 '21 at 11:15

1 Answers1

0

Working with pandas won't speed up things for you, or at least not significantly, as pandas DataFrame is intended to work with tabular data, and is not optimized for network structure of data.

Working with networkx will definitely simplify your code, but I'm afraid it also won't significantly speed up things for you, as this library is built from pure python, the same as your code.

You will achieve better performance by using other fast libraries such as igraph or the SGraph object from turicreate. For both libraries, one need to get used to work with, but both will certainly speed up things for you.

Another approach to speed things up is of course finding a suitableb and faster algorithm to your problem.

Now, to be more practical, what you are actually doing is generating all possible spanning trees of the graph. Maybe here you can find some good references for a suitable algorithm that will achieve what you want in less steps.

Also found this Which includes an answer that suggests code for finding all minimum spanning trees. You can see how it creates a networkx's Graph object and use the function. Then you can asdign all edges with the same weight, and it will just find you all spanning trees in the graph.

ronpi
  • 470
  • 3
  • 8
  • That's what I was looking for. I couldn't find a right words to explain what I want to get but "generating all possible spanning trees" fits perfectly. The only downside to this is that I need to create a lot of them and also filter them, but I will make other post for that and maybe someone will have the answer. Thanks! – neekitit Aug 24 '21 at 21:08
  • For 13k edges it is going generate a lot, no doubt. You didn't mention anything about filtering though, in that case you can use yield and create a generator from the suggested function, and filter them as they are generated instead of aggregating a huge list of all results and then filter. – ronpi Aug 25 '21 at 02:08
  • I added whole question with whole idea behind it if you want to check it. https://stackoverflow.com/questions/68921115/find-all-paths-from-leafs-to-each-node-in-a-forest – neekitit Aug 25 '21 at 11:22