I am trying to extract the text from flowcharts and decision trees. If I use the image with original boxes/shapes, the text region detection is poor. Is there any way to remove these shapes (keeping the text)?
Asked
Active
Viewed 617 times
4
-
you can use HoughLineDetector to detect all the straight lines, then fill the lines with the background color. – ZdaR Apr 24 '18 at 03:37
-
I would probably use [shape detection](https://stackoverflow.com/a/11427501/6225741), then run OCR on each ROI? – Nayfe Apr 24 '18 at 07:35
-
@Nayfe Some texts are outside the boxes, a shape detection misses those regions. I will update the photo. – Bade Apr 24 '18 at 11:59
1 Answers
1
You could use connectedComponentsWithStats()
, you will have single component for the chart lines, then just remove that component from the image.

fireant
- 14,080
- 4
- 39
- 48
-
1Could you please elaborate a little bit? There is almost no documentation available on connectedComponentsWithStats for Python3. If Python3 is not your preferred language, then maybe you can write the steps that you envision will help removing rectangles from the above image. – Bade May 07 '18 at 20:13