1

A simple line class is define as two PointF members containing the start and end coordinates:

public class Line    {
    PointF s, e;
}

I have two lists containing all horizontal and vertical lines that appear on a drawing canvas and form one or more tables.

List<Line> AllHorizontalLines;
List<Line> AllVerticalLines;

I need to group these lines so that lines belonging to one table are captured in a single group, thus the grouping function would have a signature like this:

List<List<Line>> GroupLines(List<Line> hor, List<Line> ver)
{ 

}

For simplicity we are assuming that there are only "simple" tables on the page, i.e. no nested table are there. However there can be merged cells, so we have to ignore small horizontal and vertical lines that are smaller than the full height of the parent table. For further simplicity, assume that both input lists are sorted (horizontal lines w.r.t. Y-axis and vertical lines w.r.t. X-axis).

Is there any known algorithm to solve this problem? Or can anyone help me devise one?

dotNET
  • 33,414
  • 24
  • 162
  • 251
  • Is this a homework question? If so you should probably tag it as such. – Joey Sep 14 '12 at 12:20
  • No it isn't. It is part of a larger and more complex problem I'm trying to solve. – dotNET Sep 14 '12 at 12:24
  • Ok, have you had a stab at the problem yet? If so, what did you come up with? – Joey Sep 14 '12 at 12:30
  • I'm working on it at the moment. Have given it a few tries trying to locate the "corners" (meeting points of left-most and right-most vertical lines with the horizontal lines), but it hasn't produced good results so far. – dotNET Sep 14 '12 at 12:44

3 Answers3

1

The following seems to work:

  • Set up a dictionary mapping bounding rectangles to a list of lines in each rectangle.
  • For each line in both your input lists (we don't care about the direction)
    • Create a bounding rectangle out of the line
    • Check if the line crosses any existing bounding rectangle(s).
      • If so, merge the lines from those rectangle(s), add the current line, delete the touched rectangles, calculate a new bounding rectangle, and check again.
      • Otherwise, add this new rectangle to the dictionary and delete the old one(s).
  • Return the list of lines from each rectangle.

Here's the code I've got:

public static IEnumerable<IEnumerable<Line>> GroupLines(IEnumerable<Line> lines)
{
    var grouped = new Dictionary<Rectangle, IEnumerable<Line>>();

    foreach (var line in lines)
    {
        var boundedLines = new List<Line>(new[] { line });
        IEnumerable<Rectangle> crossedRectangles;
        var boundingRectangle = CalculateRectangle(boundedLines);
        while (
            (crossedRectangles = grouped.Keys
                .Where(r => Crosses(boundingRectangle, r))
                .ToList()
            ).Any())
        {
            foreach (var crossedRectangle in crossedRectangles)
            {
                boundedLines.AddRange(grouped[crossedRectangle]);
                grouped.Remove(crossedRectangle);
            }
            boundingRectangle = CalculateRectangle(boundedLines);
        }
        grouped.Add(boundingRectangle, boundedLines);
    }
    return grouped.Values;
}

public static bool Crosses(Rectangle r1, Rectangle r2)
{
    return !(r2.Left > r1.Right ||
        r2.Right < r1.Left ||
        r2.Top > r1.Bottom ||
        r2.Bottom < r1.Top);
}

public static Rectangle CalculateRectangle(IEnumerable<Line> lines)
{
    Rectangle rtn = new Rectangle
    {
        Left = int.MaxValue,
        Right = int.MinValue,
        Top = int.MaxValue,
        Bottom = int.MinValue
    };

    foreach (var line in lines)
    {
        if (line.P1.X < rtn.Left) rtn.Left = line.P1.X;
        if (line.P2.X < rtn.Left) rtn.Left = line.P2.X;
        if (line.P1.X > rtn.Right) rtn.Right = line.P1.X;
        if (line.P2.X > rtn.Right) rtn.Right = line.P2.X;
        if (line.P1.Y < rtn.Top) rtn.Top = line.P1.Y;
        if (line.P2.Y < rtn.Top) rtn.Top = line.P2.Y;
        if (line.P1.Y > rtn.Bottom) rtn.Bottom = line.P1.Y;
        if (line.P2.Y > rtn.Bottom) rtn.Bottom = line.P2.Y;
    }

    return rtn;
}

WinForms app I knocked together

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • Interesting, but I see some ambiguities in the pseudocode, or maybe I haven't understood it fully. Can you do a favour and translate the algo into C#? I can even translate from Java, C++ or VB.NET if you're not familiar with C#. – dotNET Sep 14 '12 at 12:42
  • @dotNET Done. I should stress I _think_ this works so I'd be happy for your comments. – Rawling Sep 14 '12 at 12:47
  • Actually I can see a hole in this already :( If you have e.g. two full-width rectangles top and bottom, and two half-width left and right ones in between, and a vertical line touches both top and bottom without touching left and right, it _won't_ merge all four, only the top and bottom. Damn. – Rawling Sep 14 '12 at 12:53
  • New algorithm, should cover this. Basically it repeatedly checks rectangle overlaps until there are none. As a plus it should even work for diagonal lines. – Rawling Sep 14 '12 at 13:06
  • Can you plz write code for CalculateRectangle(IEnumerable ls) as well? I'm having a hard time trying to grasp it. – dotNET Sep 14 '12 at 19:13
  • e.g. [this question/answer](http://stackoverflow.com/questions/2752349/fast-rectangle-to-rectangle-intersection) has a simple check. – Rawling Sep 15 '12 at 10:07
  • Thanks Rawling. The only issue I see is the Rectangle class. Are you using System.Drawing.Rectangle? If yes, that class doesn't allow you to set Right and Bottom properties. Those are read-only. I was a bit confused to see that your code actually produced output on your end. – dotNET Sep 15 '12 at 13:19
  • I used a custom `Rectangle` with the t/b/l/r properties, because they made the code simpler than using width and height. – Rawling Sep 15 '12 at 14:20
  • Rawling, I might be reading the question incorrectly but I think the OP wants to group the lines that make up a table (and hence are touching) rather than creating a bounding box around groups of lines. So in the bottom left of your picture I think it should be split into 3 groups, not one. – Joey Sep 15 '12 at 16:11
  • Joey is right. But it looks like Rawling's approach is a superset of what I require. I just finished running his algorithm on my lines and it seemingly produces the same (correct) output as the algoritm that I have devised myself (see below). I also did some benchmarks and for 1000 datasets, Rowling's algorithm takes around 830ms on the average, whereas my algorithm takes around 203ms. Let's see how your algorithm stands, once you update your code. – dotNET Sep 15 '12 at 16:27
  • @Joey I assumed tables would be rectangular, so bounding them with rectangles would be reasonable. But if a touching-lines algo is more efficient anyway, then fair enough :) I just enjoyed the question and figuring something out. – Rawling Sep 17 '12 at 07:10
1

Here is my suggested approach:

  • Clean both lists so there aren't any fully contained 'small' lines.
  • Pick any line.
  • Take all the lines that touch (intersect) this line.
  • For each of these lines take all the lines that touch them.
  • Continue until you can't find any more touching lines.
  • You now have a group.
  • Pick a line from those that remain and repeat until there are no more lines left.

Code:

public static IEnumerable<IEnumerable<Line>> Group(IEnumerable<Line> horizontalLines, IEnumerable<Line> verticalLines)
{
  // Clean the input lists here. I'll leave the implementation up to you.

  var ungroupedLines = new HashSet<Line>(horizontalLines.Concat(verticalLines));
  var groupedLines = new List<List<Line>>();

  while (ungroupedLines.Count > 0)
  {
    var group = new List<Line>();
    var unprocessedLines = new HashSet<Line>();
    unprocessedLines.Add(ungroupedLines.TakeFirst());

    while (unprocessedLines.Count > 0)
    {
      var line = unprocessedLines.TakeFirst();
      group.Add(line);
      unprocessedLines.AddRange(ungroupedLines.TakeIntersectingLines(line));
    }

    groupedLines.Add(group);
  }

  return groupedLines;
}

public static class GroupingExtensions
{
  public static T TakeFirst<T>(this HashSet<T> set)
  {
    var item = set.First();
    set.Remove(item);
    return item;
  }

  public static IEnumerable<Line> TakeIntersectingLines(this HashSet<Line> lines, Line line)
  {
    var intersectedLines = lines.Where(l => l.Intersects(line)).ToList();
    lines.RemoveRange(intersectedLines);
    return intersectedLines;
  }

  public static void RemoveRange<T>(this HashSet<T> set, IEnumerable<T> itemsToRemove)
  {
    foreach(var item in itemsToRemove)
    {
      set.Remove(item);
    }
  }

  public static void AddRange<T>(this HashSet<T> set, IEnumerable<T> itemsToAdd)
  {
    foreach(var item in itemsToAdd)
    {
      set.Add(item);
    }
  }
}

Add this method to Line

public bool Intersects(Line other)
{
  // Whether this line intersects the other line or not.
  // I'll leave the implementation up to you.
}

Notes:

If this code runs too slowly you might need to scan horizontally, picking up connected lines as you go. Might also be worth looking at this.

Specialised:

public static IEnumerable<IEnumerable<Line>> Group(IEnumerable<Line> horizontalLines, IEnumerable<Line> verticalLines)
{
  // Clean the input lists here. I'll leave the implementation up to you.

  var ungroupedHorizontalLines = new HashSet<Line>(horizontalLines);
  var ungroupedVerticalLines = new HashSet<Line>(verticalLines);
  var groupedLines = new List<List<Line>>();

  while (ungroupedHorizontalLines.Count + ungroupedVerticalLines.Count > 0)
  {
    var group = new List<Line>();
    var unprocessedHorizontalLines = new HashSet<Line>();
    var unprocessedVerticalLines = new HashSet<Line>();

    if (ungroupedHorizontalLines.Count > 0)
    {
      unprocessedHorizontalLines.Add(ungroupedHorizontalLines.TakeFirst());
    }
    else
    {
      unprocessedVerticalLines.Add(ungroupedVerticalLines.TakeFirst());
    }

    while (unprocessedHorizontalLines.Count + unprocessedVerticalLines.Count > 0)
    {
      while (unprocessedHorizontalLines.Count > 0)
      {
        var line = unprocessedHorizontalLines.TakeFirst();
        group.Add(line);
                unprocessedVerticalLines.AddRange(ungroupedVerticalLines.TakeIntersectingLines(line));
      }
      while (unprocessedVerticalLines.Count > 0)
      {
        var line = unprocessedVerticalLines.TakeFirst();
        group.Add(line);
        unprocessedHorizontalLines.AddRange(ungroupedHorizontalLines.TakeIntersectingLines(line));
      }
    }
    groupedLines.Add(group);
  }

  return groupedLines;
}

This assumes no lines overlap as it doesn't check if horizontal lines touch other horizontal lines (same for vertical).

You can probably remove the if-else. That's just there in case there are vertical lines not attached to horizontal lines.

Joey
  • 1,752
  • 13
  • 17
  • Yes, please translate that into C# (or Java or C++ or VB.NET), so I know what exactly are you proposing? – dotNET Sep 14 '12 at 19:16
  • If you need me to provide implementations for the bits I've left out I can but they shouldn't be too hard. – Joey Sep 14 '12 at 23:16
  • The above code contains several errors. Even worse, after correcting all of those, the code gets stuck in an (seemingly) infinite loop. For an input of 45 horizontal and 19 vertical lines, it had added 2021707 lines to the group variable and was going on when I interrupted the execution. – dotNET Sep 15 '12 at 07:53
  • Sorry for the errors, I wrote it using notepad. I'll download and editor and fix it up. Sounds like the lines aren't being removed from the sets, I'll look into it. – Joey Sep 15 '12 at 16:07
  • Fixed now. I tried it on some trivial lines and it worked for me. – Joey Sep 15 '12 at 16:37
  • Great work Joey. It works as fine as Rawling's, and in far lesser time. Average time for your algorithm appears to be around 345ms, somewhat higher than my algorithm. – dotNET Sep 15 '12 at 17:30
  • I'll write a specialised version of the algorithm tomorrow when I have access to VS again. It should hopefully halve the time. – Joey Sep 16 '12 at 08:33
  • Great. The specialized version runs around 0.5% faster than mine, so the performance is roughly equal and results are as good as the other 3 algorithms. I'll mark this as the correct answer. However, there is one downside of this algorithm compared to mine. This algorithm mixes up horizontal and vertical lines in a single collection, so you'll need to spend some more processing time separating them. – dotNET Sep 17 '12 at 17:30
0

OK. I finally devised an efficient algorithm myself. Here are the steps for any future reader.

  1. Define a grand collection of Tuple<List<Line>, List<Line>>. Each tuple in this collection will represent one table (all horizontal and vertical lines of that table).

  2. Find all vertical lines that touch a horizontal line's starting point at one end and another horizontal line's starting point at the other end. These will be the left-most lines of each table.

  3. Now for each left-line (call it LV): a. locate all the vertical lines that have same starting and ending Y as LV and thus belong to the same table. Add these "sister" lines to a list called MyVSisters.

    b. Locate the right-most vertical sister line's X-coordinate (call it RVX).

    c. Now locate all the horizontal lines that stretch from the LV's X to RVX and have their Y between LV's Y1 and Y2. Add these lines to a collection called MyHSisters.

    d. Add the tuple MyHSisters and MyVSisters to the grand collection.

  4. Return grand collection.

I'm not marking it as answer yet. I'll wait for response from both of the guys and compare performance and accuracy of all 3 before deciding which one is the best answer.

dotNET
  • 33,414
  • 24
  • 162
  • 251
  • OK. After looking into all the 3 algorithm provided on this page, my algorithm appears to be consuming the least time, whereas the accuracy of all algorithms appears to be equally good. So I'll mark this answer as correct. – dotNET Sep 15 '12 at 17:33
  • What happens if you have two identical tables side-by-side? Does step 3a work correctly or does it take too many lines? Also, does 3c work when you have two identical tables, one above the other or does it take too many lines? – Joey Sep 16 '12 at 08:31
  • Thanks Joey. Two identical table can appear one above the other and my code takes care of that. I have now modified step 3c to match my actual code. Side-by-side tables are however not possible in my input. – dotNET Sep 16 '12 at 14:33