9

I am having trouble with the speed of the search function that I wrote. The function steps are described below:

  1. The function begins with two table name parameters, a starting-point and a target
  2. The function then traverses a list of table-column combinations (50,000 long) and retrieves all the combinations associated with the starting-point table.
  3. The function then loops through each of the retrieved combinations and for each combination, it traverses the table-column combinations list once again, but this time looking for tables that match the given column.
  4. Finally, the function loops through each of the retrieved combinations from the last step and for each combination, it checks whether the table is the same as the target table; if so it saves it, and if not it calls itself passing in the table name form that combination.

The function aim is to be able to trace a link between tables where the link is direct or has multiple degrees of separation. The level of recursion is a fixed integer value.

My problem is that any time I try to run this function for two levels of search depth (wouldn't dare try deeper at this stage), the job runs out of memory, or I lose patience. I waited for 17mins before the job ran out of memory once.

The average number of columns per table is 28 and the standard deviation is 34.

Here is a diagram showing examples of the various links that can be made between tables:

Each column can have a match in multiple tables. Each matching table can then be searched column by column for tables with matching columns and so on

Here is my code:

private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode parentNode, string targetTable, int maxSearchDepth)
{
    if (parentNode.Level < maxSearchDepth)
    {
        IEnumerable<string> tableColumns = sourceList.Where(x => x.Table.Equals(parentNode.Table)).Select(x => x.Column);

        foreach (string sourceColumn in tableColumns)
        {
            string shortName = sourceColumn.Substring(1);

            IEnumerable<TableSearchNode> tables = sourceList.Where(
                x => x.Column.Substring(1).Equals(shortName) && !x.Table.Equals(parentNode.Table) && !parentNode.Ancenstory.Contains(x.Table)).Select(
                    x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 });
            foreach (TableSearchNode table in tables)
            {
                parentNode.AddChildNode(sourceColumn, table);
                if (!table.Table.Equals(targetTable))
                {
                    FindLinkingTables(sourceList, table, targetTable, maxSearchDepth);
                }
                else
                {
                    table.NotifySeachResult(true);
                }
            }
        }
    }
}

EDIT: separated out TableSearchNode logic and added property and method for completeness

//TableSearchNode
public Dictionary<string, List<TableSearchNode>> Children { get; private set; }

//TableSearchNode
public List<string> Ancenstory
{
    get
    {
        Stack<string> ancestory = new Stack<string>();
        TableSearchNode ancestor = ParentNode;
        while (ancestor != null)
        {
            ancestory.Push(ancestor.tbl);
            ancestor = ancestor.ParentNode;
        }
        return ancestory.ToList();
    }
}

//TableSearchNode
public void AddChildNode(string referenceColumn, TableSearchNode childNode)
    {
        childNode.ParentNode = this;
        List<TableSearchNode> relatedTables = null;
        Children.TryGetValue(referenceColumn, out relatedTables);
        if (relatedTables == null)
        {
            relatedTables = new List<TableSearchNode>();
            Children.Add(referenceColumn, relatedTables);
        }
        relatedTables.Add(childNode);
    }

Thanks in advance for your help!

Sinker
  • 576
  • 1
  • 7
  • 21
  • This may help, if I remember correctly tail calls will not flood the stack - but I could be horribly wrong. http://blogs.msdn.com/b/clrcodegeneration/archive/2009/05/11/tail-call-improvements-in-net-framework-4.aspx (disclaimer - I am a business major) – Eric Scherrer Jun 13 '14 at 12:17
  • 2
    I'm looking at this in more detail, but one comment is that if performance is a big issue, you might want to consider removing all the LINQ calls. – nicholas Jun 13 '14 at 12:21
  • 1
    @nicholas Out of interest, where have you read that LINQ would necessarily be slower than an alternative? – Daniel Kelley Jun 13 '14 at 12:30
  • May it be that you get too many false positives when comparing only the first character? Btw, this comparison is faster when you use `.[0]` instead of `.Substring(1)`. Also, you might want to remove the `ToList()` calls. These seem unnecessary. – Nico Schertler Jun 13 '14 at 12:30
  • 1
    @EricScherrer thank you very much for the link. Unfortunately because each function can call itself multiple times in a loop, I found that I couldn't mimic the examples; *however*, I changed `if (!table.Table.Equals(targetTable))` to `if (table.Table.Equals(targetTable))` and swapped the inner statements, so that the recursive call is the last statement. I am not sure how that translates at execution. If you have a better way of doing it, let me know. Btw, I already went horribly wrong in all the possible ways (process ate 1.5gb of RAM), so don't worry :) – Sinker Jun 13 '14 at 12:30
  • Isn't this just a shortest-path search? Compute the set of tables reachable in one step. From that, compute the set of tables reachable in two steps. And so on. – usr Jun 13 '14 at 12:32
  • @DanielKelley, apart from the relatively slower performance of LINQ in general, one example is generating the `List tableColumns` in the first place. Iterating over `sourceList` with an internal `if` check would avoid traversing the list twice. That's just one example. – nicholas Jun 13 '14 at 12:33
  • @nicholas thank you for you suggestion. I moved the data from an SQL first into a List in memory before I pass it into this method, instead of running the queries directly on the DBSet. I found that this was much faster, reducing searches to 10-25 milliseconds. The problem is that as *Nico Schetler* alluded to, there are too many iterations. – Sinker Jun 13 '14 at 12:36
  • @nicholas "apart from the relatively slower performance of LINQ in general" - that is the part I am interested. Do you have a resource or link that backs that up? Something empirical that shows that LINQ is inherently slow than a simple `foreach` for example? – Daniel Kelley Jun 13 '14 at 12:38
  • @Sinker, this is really better handled by pre-hashing your list instead of iterating over it. I'm working up an answer that implements this. Are there any circular relationships? – nicholas Jun 13 '14 at 12:38
  • @NicoSchertler I changed the code as per your suggestion so that I use `IEnummerable` instead of `List` and that way, I wouldn't have to convert to a `List`. – Sinker Jun 13 '14 at 12:43
  • @nicholas there are circular relationships in the sense that TableA-Column1 is related to TableB-Column5 which is related to TableA-Column1, but the following two LINQ Where conditions filter out these self-references: `!x.Table.Equals(parentNode.Table)` and `!parentNode.Ancenstory.Contains(x.Table)`. – Sinker Jun 13 '14 at 12:47
  • @DanielKelley this is sort of what you are looking for. [LINQ does not come for free](http://stackoverflow.com/a/1185238/1193596) – Amicable Jun 13 '14 at 12:49
  • 2
    You're wasting your time optimizing the code in all those small ways. You'll get a constant speedup but even 10x will not help you. You need an entirely new algorithm that has lower asymptotic cost. – usr Jun 13 '14 at 13:05
  • @usr correct. In the end, I am ending up with (ListCount)^x-1 loops where x is the depth – Sinker Jun 13 '14 at 13:08
  • So what do you think of my suggestion? Maybe you didn't see the comment. – usr Jun 13 '14 at 13:09
  • @usr I had to have a second look to see your previous comment. Apologies. I am trying to remember my algorithms class from 5 years ago so my terminology is rusty. What this functions is trying to do is find *all* the paths to target, with the restrictions being #1. you can't go over nodes that you have already been over (i.e. link to tables already linked to in the chain) and #2. the number of nodes in the chain must be <= the max specified. – Sinker Jun 13 '14 at 13:34
  • There are various elements from pretty much all the answers below that I want to incorporate in a second attempt. I will give it a go and update the post. Thanks for all the fantastic effort! I love this community! – Sinker Jun 13 '14 at 14:19

4 Answers4

4

You really wasting a lot of memory. What immediately comes to mind:

  1. First of all replace incoming List<TableColumns> sourceList with ILookup<string, TableColumns>. You should do this once before calling FindLinkingTables:

    ILookup<string, TableColumns> sourceLookup = sourceList.ToLookup(s => s.Table);
    FindLinkingTables(sourceLookup, parentNode, targetTable, maxSearchDepth);
    
  2. Do not call .ToList() if do not really need it. For example, if you are going only to enumerate all children of resulting list once, you do not need it. So your main function will looks like this:

    private void FindLinkingTables(ILookup<string, TableColumns> sourceLookup, TableSearchNode parentNode, string targetTable, int maxSearchDepth)
    {
        if (parentNode.Level < maxSearchDepth)
        {
            var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column);
    
            foreach (string sourceColumn in tableColumns)
            {
                string shortName = sourceColumn.Substring(1);
    
                var tables = sourceLookup
                    .Where(
                        group => !group.Key.Equals(parentNode.Table)
                                 && !parentNode.Ancenstory.Contains(group.Key))
                    .SelectMany(group => group)
                    .Where(tableColumn => tableColumn.Column.Substring(1).Equals(shortName))
                    .Select(
                        x => new TableSearchNode
                        {
                            Table = x.Table,
                            Column = x.Column,
                            Level = parentNode.Level + 1
                        });
    
                foreach (TableSearchNode table in tables)
                {
                    parentNode.AddChildNode(sourceColumn, table);
                    if (!table.Table.Equals(targetTable))
                    {
                        FindLinkingTables(sourceLookup, table, targetTable, maxSearchDepth);
                    }
                    else
                    {
                        table.NotifySeachResult(true);
                    }
                }
            }
        }
    }
    

    [Edit]

  3. Also in order to speedup remaining complex LINQ query, you can prepare yet another ILookup:

    ILookup<string, TableColumns> sourceColumnLookup = sourceLlist
            .ToLookup(t => t.Column.Substring(1));
    
    //...
    
    private void FindLinkingTables(
        ILookup<string, TableColumns> sourceLookup, 
        ILookup<string, TableColumns> sourceColumnLookup,
        TableSearchNode parentNode, string targetTable, int maxSearchDepth)
    {
        if (parentNode.Level >= maxSearchDepth) return;
    
        var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column);
    
        foreach (string sourceColumn in tableColumns)
        {
            string shortName = sourceColumn.Substring(1);
    
            var tables = sourceColumnLookup[shortName]
                .Where(tableColumn => !tableColumn.Table.Equals(parentNode.Table)
                                      && !parentNode.AncenstoryReversed.Contains(tableColumn.Table))
                .Select(
                    x => new TableSearchNode
                        {
                            Table = x.Table,
                            Column = x.Column,
                            Level = parentNode.Level + 1
                        });
    
    
            foreach (TableSearchNode table in tables)
            {
                parentNode.AddChildNode(sourceColumn, table);
                if (!table.Table.Equals(targetTable))
                {
                    FindLinkingTables(sourceLookup, sourceColumnLookup, table, targetTable, maxSearchDepth);
                }
                else
                {
                    table.NotifySeachResult(true);
                }
            }
        }
    }
    
  4. I've checked your Ancestory property. If IEnumerable<string> is enough for your needs check this implementation:

    public IEnumerable<string> AncenstoryEnum
    {
        get { return AncenstoryReversed.Reverse(); }
    }
    
    public IEnumerable<string> AncenstoryReversed
    {
        get
        {
            TableSearchNode ancestor = ParentNode;
            while (ancestor != null)
            {
                yield return ancestor.tbl;
                ancestor = ancestor.ParentNode;
            }
        }
    }
    
Woodman
  • 1,108
  • 9
  • 11
  • I modified the code to use `IEnummerable` instead of `List` removing the need for the `ToList()`. I will try `ILookup`. Thanks. – Sinker Jun 13 '14 at 12:49
  • Good idea to have two lookups. I will try that. – Sinker Jun 13 '14 at 13:02
  • Also check other possible implementation for `Ancenstory` property. `AncenstoryReversed` enumerates all parents up to the top, without need of temporary buffer. But `AncenstoryEnum` of course do use additional storage (it is an `Array` created within `LINQ` `Reverse` function) – Woodman Jun 13 '14 at 13:11
  • Never see yield before. Thanks! – Sinker Jun 13 '14 at 13:52
  • Oh, I missed the fact that you are calling Ancenstory property on each recursion level... Ok, I've corrected code in listed under 3. point, so it reuses AncenstoryReversed property. At least there will be no additional memory cost... But overall logic looks suspicious, maybe you should consider alternative ways compose results (keeping in mind all performance advices mentioned here). – Woodman Jun 13 '14 at 15:33
2

I've managed to refactor your FindLinkingTables code down to this:

private void FindLinkingTables(
    List<TableColumns> sourceList, TableSearchNode parentNode,
    string targetTable, int maxSearchDepth)
{
    if (parentNode.Level < maxSearchDepth)
    {
        var sames = sourceList.Where(w => w.Table == parentNode.Table);

        var query =
            from x in sames
            join y in sames
                on x.Column.Substring(1) equals y.Column.Substring(1)
            where !parentNode.Ancenstory.Contains(y.Table)
            select new TableSearchNode
            {
                Table = x.Table,
                Column = x.Column,
                Level = parentNode.Level + 1
            };

        foreach (TableSearchNode z in query)
        {
            parentNode.AddChildNode(z.Column, z);
            if (z.Table != targetTable)
            {
                FindLinkingTables(sourceList, z, targetTable, maxSearchDepth);
            }
            else
            {
                z.NotifySeachResult(true);
            }
        }
    }
}

It appears to me that your logic in the where !parentNode.Ancenstory.Contains(y.Table) part of the query is flawed. I think you need to rethink your search operation here and see what you come up with.

Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • I added more code to show how the "Ancestory" is built into the Nodes. I will review it anyway to see if there are any issues. Debug is my friend! – Sinker Jun 13 '14 at 13:21
2

There are a few things that stand out to me looking at this source method:

  1. In your Where clause you make a call to parentNode.Ancenstory; this has logarithmic run time by itself, then you make a call to .Contains on the List<string> it returns, which is another logarithmic call (it's linear, but the list has a logarithmic number of elements). What you're doing here is checking for cycles in your graph. These costs can be made constant by adding a field to TableColumns.Table which stores information on how that Table has been processed by the algorithm (alternatively, you could use a Dictionary<Table, int>, to avoid adding a field to the object). Typically, in a DFS algorithm, this field is either White, Grey, or Black - White for unprocessed (you haven't seen that Table before), Grey for an ancestor of the Table currently being processed, and Black for when you're done processing that Table and all of its children. To update your code to do this, it'd look like:

    foreach (string sourceColumn in tableColumns)
    {
        string shortName = sourceColumn.Substring(1);
    
        IEnumerable<TableSearchNode> tables =
            sourceList.Where(x => x.Column[0].Equals(shortName) &&
                                  x.Color == White)
                      .Select(x => new TableSearchNode
                                       {
                                            Table = x.Table,
                                            Column = x.Column,
                                            Level = parentNode.Level + 1
                                        });
        foreach (TableSearchNode table in tables)
        {
            parentNode.AddChildNode(sourceColumn, table);
    
            table.Color = Grey;
    
            if (!table.Table.Equals(targetTable))
            {
                FindLinkingTables(sourceList, table, targetTable, maxSearchDepth);
            }
            else
            {
                table.NotifySeachResult(true);
            }
    
            table.Color = Black;
        }
    }
    
  2. As you mentioned above, you're running out of memory. The easiest fix for this is to remove the recursive call (which is acting as an implicit stack) and replace it with an explicit Stack data structure, removing the recursion. Additionally, this changes the recursion to a loop, which C# is better at optimizing.

    private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode root, string targetTable, int maxSearchDepth)
    {
        Stack<TableSearchNode> stack = new Stack<TableSearchNode>();
        TableSearchNode current;
    
        stack.Push(root);
    
        while (stack.Count > 0 && stack.Count < maxSearchDepth)
        {
            current = stack.Pop();
    
            var tableColumns = sourceList.Where(x => x.Table.Equals(current.Table))
                                         .Select(x => x.Column);
    
            foreach (string sourceColumn in tableColumns)
            {
                string shortName = sourceColumn.Substring(1);
    
                IEnumerable<TableSearchNode> tables =
                    sourceList.Where(x => x.Column[0].Equals(shortName) &&
                                          x.Color == White)
                              .Select(x => new TableSearchNode
                                               {
                                                    Table = x.Table,
                                                    Column = x.Column,
                                                    Level = current.Level + 1
                                                });
                foreach (TableSearchNode table in tables)
                {
                    current.AddChildNode(sourceColumn, table);
    
                    if (!table.Table.Equals(targetTable))
                    {
                        table.Color = Grey;
                        stack.Push(table);
                    }
                    else
                    {
                        // you could go ahead and construct the ancestry list here using the stack
                        table.NotifySeachResult(true);
                        return;
                    }
                }
            }
    
            current.Color = Black;
    
        }
    }
    
  3. Finally, we don't know how costly Table.Equals is, but if the comparison is deep then that could be adding significant run time to your inner loop.

Zack Butcher
  • 1,046
  • 7
  • 12
  • There's a lot to try and take-in here. I looked up DFS because I hadn't learned it before. I want to explore that option further. I think I should start reusing nodes as well to save on the memory. Let me see if I can go through your example. Thanks. – Sinker Jun 13 '14 at 13:50
  • 1
    @Sinker Wikipedia is always a great place to start to algorithms: http://en.wikipedia.org/wiki/Depth-first_search – Zack Butcher Jun 13 '14 at 13:51
2

Okay, here is an answer which basically abandons all the code you have posted.

First, you should take your List<TableColumns> and hash them into something that can be indexed without having to iterate over your entire list.

For this purpose, I have written a class called TableColumnIndexer:

class TableColumnIndexer
{
    Dictionary<string, HashSet<string>> tables = new Dictionary<string, HashSet<string>>();

    public void Add(string tableName, string columnName)
    {
        this.Add(new TableColumns { Table = tableName, Column = columnName });
    }

    public void Add(TableColumns tableColumns)
    {
        if(! tables.ContainsKey(tableColumns.Table))
        {
            tables.Add(tableColumns.Table, new HashSet<string>());
        }

        tables[tableColumns.Table].Add(tableColumns.Column);
    }

    // .... More code to follow

Now, once you inject all your Table / Column values into this indexing class, you can invoke a recursive method to retrieve the shortest ancestry link between two tables. The implementation here is somewhat sloppy, but it is written for clarity over performance at this point:

    // .... continuation of TableColumnIndexer class
    public List<string> GetShortestAncestry(string parentName, string targetName, int maxDepth)
    {
        return GetSortestAncestryR(parentName, targetName, maxDepth - 1, 0, new Dictionary<string,int>());
    }

    private List<string> GetSortestAncestryR(string currentName, string targetName, int maxDepth, int currentDepth, Dictionary<string, int> vistedTables)
    {
        // Check if we have visited this table before
        if (!vistedTables.ContainsKey(currentName))
            vistedTables.Add(currentName, currentDepth);

        // Make sure we have not visited this table at a shallower depth before
        if (vistedTables[currentName] < currentDepth)
            return null;
        else
            vistedTables[currentName] = currentDepth;


        if (currentDepth <= maxDepth)
        {
            List<string> result = new List<string>();

            // First check if the current table contains a reference to the target table
            if (tables[currentName].Contains(targetName))
            {
                result.Add(currentName);
                result.Add(targetName);
                return result;
            }
            // If not try to see if any of the children tables have the target table
            else
            {
                List<string> bestResult = null;
                    int bestDepth = int.MaxValue;

                foreach (string childTable in tables[currentName])
                {
                    var tempResult = GetSortestAncestryR(childTable, targetName, maxDepth, currentDepth + 1, vistedTables);

                    // Keep only the shortest path found to the target table
                    if (tempResult != null && tempResult.Count < bestDepth)
                    {
                        bestDepth = tempResult.Count;
                        bestResult = tempResult;
                    }
                }

                // Take the best link we found and add it to the result list
                if (bestDepth < int.MaxValue && bestResult != null)
                {
                    result.Add(currentName);
                    result.AddRange(bestResult);
                    return result;
                }
                // If we did not find any result, return nothing
                else
                {
                    return null;
                }
            }
        }
        else
        {
            return null;
        }
    }
}

Now all this code is just a (somewhat verbose) implementation of a shortest path algorithm which allows for circular paths and multiple paths between the source table and target table. Note that if there are two routes with the same depth between two tables, the algorithm will select only one (and not necessarily predictably).

nicholas
  • 2,969
  • 20
  • 39