Using a screenshot of the linked crossword as example, I assume that:
- the crossword grid is crisp, i.e. the horizontal and vertical grid lines are drawn at exact pixels with a constant dark colour and that there is no noise inside the grid cells,
- the crossword is black or another relatively dark colour ("black") on white or light grey ("white"),
- the clue numbers are written in the top left corner,
- the crossword is rectangular and regular.
You can then scan the image from top to bottom to find horizontal black lines of sufficient length. A line starts with a black pixel and ends with a white pixel. Other pixels are indicators that it is not a line. (This is to weed out text and buttons.) Do the same for vertical lines.
Ideally, you now have the crossword lines. If your image is not cropped to the crossword, you might have false positives, such as the button borders. To find the crossword lines, sort them by length and look for the largest contiguous block of the same length. These should be your crossword lines unless you hae some degenerate cases
Now do a nested loop of horizontal and vertical lines, but skip the first line. Look two or three pixels to the northwest of the intersection of the lines. If the pixel is dark, that's a blank. If it is light, it's a cell. This heuristic seems to work well. I say dark and light here, bacause some crosswords use grey cells to save on ink when printing and some cell are highlighted in the screenshot.
If you end up with no blanks, you have a barred crossword. You can find the bars by checking whether one of the pixels to the left and right of a cell border is black.
Lastly, a tip: If you want to use your algorithm to find the cells in a crossword generated with the Crossword Compiler, look at the source. You will find a link to a Javascript file /puzzles/sample/cryptic_demo/cryptic_demo_xml.js
, which contans the crossword as XML string, which also gives you the clues as a bonus.
Older versions of the Crossword Compiler, such as the one used for the Independent Cryptic hide their data in a file loaded from an applet. The format of that file is binary, but not too hard to read if you know the original data.