[This is of interest to us.] I am assuming your input is effectively a bitmap - a rectangular matrix of pixels. The first question is whether it is aligned with the axes - if it's been scanned it's probably not. You may need deskewing algorithms (rather dated but it's a useful start: http://www.eecs.berkeley.edu/~fateman/kathey/node11.html)
The classic line detection is the Hough transform (http://en.wikipedia.org/wiki/Hough_transform) though our current collaborators do better than this for simple boxes and project pixels onto different viewpoints - similar to tomography. Rotate the image and count the density/histogram of points on the projection lines. For simple boxes that gives a clear signal.
For the text I suspect you either have to have a set of likely fonts or to use machine learning. In the latter you have to devise features and then select a series of images that are classified by humans as text and not-text. Your algorithm (and there are many, neural nets, maximum entropy, etc.) are then trained against these.
The quality of the pixel map makes a great deal of difference. Documents 20 years ago and much harder than bitmaps of documents created though drawing programs and dumped as PDF (of course if you can interpret text in PDF that helps a good deal.)