1

Requiring the mighty help of stack overflow. I actually work on an app that has to analyze via OCR (I'm using tesseract) documents and extract all the text I can get out of it. Here is an example of the type of image:

Image including text to extract

Here is what I do on preprocessing to get rid of all the lines. In the future I would also probably have to analyze each "rectangle" separatly (feeding a zone defined by given lines to tesseract) so I guess there's simpler methods than this but i wouldn't have the "lines" coordinates.

package formRecog;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.opencv.core.Core;
import org.opencv.core.Mat;
import org.opencv.core.Point;
import org.opencv.core.Scalar;
import org.opencv.core.Size;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;
import static org.opencv.core.Core.bitwise_not;
import org.opencv.core.MatOfPoint;


public class testMat {

    public static void main(String[] args) {

        System.loadLibrary(Core.NATIVE_LIBRARY_NAME);

        Mat source  = Imgcodecs.imread("./image.png",Imgcodecs.CV_LOAD_IMAGE_ANYCOLOR);
        Mat destination  = new Mat(source.rows(), source.cols(), source.type());
        Imgproc.cvtColor(source, destination, Imgproc.COLOR_RGB2GRAY);  
        Imgcodecs.imwrite("gray.jpg", destination);

        Imgproc.GaussianBlur(destination, destination, new Size(3, 3), 0, 0, Core.BORDER_DEFAULT);  

        Imgproc.Canny(destination, destination, 30, 90);
        Imgcodecs.imwrite("postcanny.jpg", destination);

        Mat houghlines = new Mat(); 
        Imgproc.HoughLinesP(destination, houghlines, 1, Math.PI / 180,  250, 185,5);

        //DESSINER LES LIGNES
        Mat result = new Mat(source.rows(), source.cols(), source.type());
        for (int i = 0; i < houghlines.rows(); i++) {
            double[] val = houghlines.get(i, 0);
            Imgproc.line(destination, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255), 5);
            Imgproc.line(result, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255),5);
        }

        Imgcodecs.imwrite("lines.jpg", result);

        Mat contourImg = new Mat(source.rows(), source.cols(), source.type());
        List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
        Mat hierarchy = new Mat();
        //Point offset = new Point();

        Imgproc.findContours(destination, contours, hierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_NONE );
        Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);

        Imgcodecs.imwrite("contour.jpg", contourImg);

        bitwise_not(destination,destination);


        Imgcodecs.imwrite("final.jpg", destination);

    }
}

Here is the final image

Final image after processing

Problem is, tesseract doesnt read anything on this :

11m ËEZË@ÜDS@ 7 C@mpû@ 515 îf@5@??ûäû ©©m@@@ @@ vësw??a? PF©@MÜGS @"@X@Ü©ÜÎÊQÜ©IÏÙ 1111 175515

Is the first "line" I get.

I think it is because the letters arent "filled" anymore and tesseract cannot read them, because tesseract actually gave me pretty good results precedently but the lines deleting method wasnt good. I'd like to fill the letters with black but

Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);

doesnt do anything, although I'm pretty sure findContours worked fine cause if I imwrite the result of it I get the very same image as before.

I searched similar problemslike cv2.drawContours will not draw filled contour and Contour shows dots rather than a curve when retrieving it from the list, but shows the curve otherwise but didn't find anything I could use (maybe didn't get it).

Just so you know, I started programming courses like in september so I'm pretty new to the thing (forgive me if there's some gruesome things written here), but I don't have a choice on the subject I'm working on :)

I hope I made myself clear enough and my english isn't too bad.

My thanks.

EDIT: Thanks to Rick.M It's getting better, using CHAIN_APPROX_SIMPLE in findcontours and iterating via ldx in drawcontours did the trick. New final

Is there a way to improve this result ? I'm guessing tesseract won't eat this aswell ? thanks

Uploading postcanny image : Image after canny

DSt
  • 13
  • 5
  • Have you tried to draw the contours using contourIdx instead of -1? – Rick M. May 18 '18 at 08:39
  • Do you mean by iterating contoursldx to draw each contour separatly ? I just tried for (int ldx = 0; ldx < contours.size(); ++ldx) Imgproc.drawContours(contourImg, contours, ldx, new Scalar(255, 0, 0),-1); With no luck, but maybe I didnt get what you mean.. – DSt May 18 '18 at 08:55
  • Yes I meant this. What does contours.size() say? – Rick M. May 18 '18 at 08:57
  • System.out.println(contours.size()); renders : 5369 – DSt May 18 '18 at 08:58
  • and contourImg.type()? – Rick M. May 18 '18 at 09:01
  • according to https://stackoverflow.com/questions/10167534/how-to-find-out-what-type-of-a-mat-object-is-with-mattype-in-opencv, this a CV_8UC3 Mat – DSt May 18 '18 at 09:11
  • Try using `CHAIN_APPROX_SIMPLE` instead. – Rick M. May 18 '18 at 09:15
  • For improving your result I would recommend using an unsharpmask/median filter before canny. Additionally, you can also try using CLAHE. Can you also upload the image after Canny? I could do this myself but unfortunately I don't have access to OpenCV atm. PS. Technically using the previous comment solved the question. Improving your result might require some work from you and another question. – Rick M. May 18 '18 at 09:39
  • Aknowledge, I will look into the filter and Clahe, thank you for your help and time, I'll just mark the thread as answered, if I find how :) – DSt May 18 '18 at 09:44
  • I've added it as an answer – Rick M. May 18 '18 at 09:50

1 Answers1

1

The reason why drawContours wasn't working as required was that the flag: CHAIN_APPROX_NONE stores absolutely all contour points. Hence, using CHAIN_APPROX_SIMPLE which compresses horizontal, vertical, and diagonal segments and leaves only their end points gives you finished contours. In this case you could also use, Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1); without the loop and should work fine.

Now, for the discussion in comments, the Canny image looks nice, but as you can see after zooming, the letters which aren't detected by findContours are not completely connected. I would suggest using erosion with a small kernel (you have to play with the parameters) to get better results.

Rick M.
  • 3,045
  • 1
  • 21
  • 39