how do I accurately extract one line of data from this image using python

Question

I'm trying to get just one line of this out. So I was hoping when I ran pytesseract, I'd get useable data out.

Instead, I'm getting strings like 'Ce eet il ae oe a on) os\n\nooo eo oo oo oom om om om)\n\n[OO COCO ORR OW OR PRP ODWWG\n\neyo fe) Fee ote) = = - = = eo me-e-)\n\n(Ss: oo ~7~oO 0 0\n\neB\n\n© te O fa ©\n\nOORFONONWR OW DFW NN\n\nVaso\nVES -5)\n1866\nnny\n1625\n1368\nLt\n1070\n898\n838\nwhey)\nom\na\nRie)\n15\n\nny,\n\n=ARAM= gksvlrwOlf\nDarth_Zipzap\naE a\njohnny478423\n=CNAPG _920831993\nOLOLUCTIIN AG\nRivDecartes\nfleadog406\nFormula13\n\nxL LongDubber\nDankdudledan\n_Trix_1740\n\nLUT engl)\n\n=MOPB= JP_Akatonbo\nPlutoh71689\nMakinHerSquirt\n\x0c'

I tried grey-scaling it to no avail. I thought given the sort of discrete columns here I'd be able to just split the string on spaces and newlines, but...no.

Any pointers in the right direction would be appreciated.

In previous experiments, I had a little trouble because of images like the little controller icons, and I was just able to detect and mask those before passing the image to tesseract. But in this image, tesseract is failing to identify the numbers in the columns pretty consistently.

Have you tried cropping the image section to just one line then use tesseract? Try to get the y-interval between lines then go through each line separately — MichaelT572, Feb 28 '21 at 22:08
The one line will change. I could probably do that if I could identify which line has the username I'm looking for on it, but I'm not sure how to do that either. I suppose I could try messing with finding smaller text box boundaries like https://www.geeksforgeeks.org/text-detection-and-extraction-using-opencv-and-ocr/ demonstrates, see if I can find the coordinates of a text box that includes the name to just pull that one line. — qkslvrwolf, Feb 28 '21 at 22:30
You can use the same solution as here: https://stackoverflow.com/questions/52083129/digit-recognizing-using-opencv/52645271#52645271 — Albert Myšák, Feb 28 '21 at 23:00

score 3 · Answer 1 · answered Mar 01 '21 at 08:46

I tried grey-scaling it to no avail.

Converting image to the gray-scale will make the calculations faster, since you are reducing from 3-channel to 1-channel. If you mean by "I have applied preprocessing, but no use", well you should look at the following techniques. Converting to gray-scale is not a preprocessing but a computational advantage.

I thought given the sort of discrete columns here I'd be able to just split the string on spaces and newlines, but...no.

Did you try with different page-segmentation-modes? Sometimes default value for recognizing the input text is not accurate. Therefore you should try with other modes.

Any pointers in the right direction would be appreciated.

The first fact of the input image is you don't need the second half. If your current image size is H and W then you need H/2 and W.

Th second fact is we need to binarize the image. Result will be:

If you read the result image, assuming a single uniform block of text:

1 0 3 0 0 7 Whey. =ARAM= qksvlrwolf
3 0 3 0 0 7 2389 Darth_Zipzap
4° 0 6 0 0 3 1866 KILLAIRE
4 1 1 0 8 1 ARs johnny478423
3 0 1 0 0 6 1625 =CNAPC= _920831993
3 0 1 0 0 3 1368 ole] NCAT LG
2 0 0 0 13 0 1291 RN Bstecl ates)
4° 0 3 0 0 1 1070 fleadog406
1 0 0 0 0 3 eds imelaial etch}
2 01 0 1 2 CRS xL LongDubber
1 0 1 0 11 0 rh Dankdudledan
0 0 0 0 0 2 611 _Trix_1740
2 1 0 0 0 #0 Zs Illinois_Fats
10000 1 309 =MOPB= JP_Akatonbo
2 0 0 0 0 90 15 Plutoh71689
1 00 0 0 0 vy MakinHerSquirt

You will get more accurate result compared to your previous try. However, not every word is accurately recognized. You can do the followings:

1. You can get line-by-line and recognize row-by-row.
1. You can add border to your image. Centering the image may enhance the accuracy.

Code:

# Load the libraries
import cv2
import pytesseract

# Load the image in BGR format
img = cv2.imread("NwEsC.png")

# Get first-half of the image
(h, w) = img.shape[:2]
img = img[0:int(h/2), 0:w]

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Threshold
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# OCR
txt = pytesseract.image_to_string(thr, config="--psm 6")
print(txt)

how do I accurately extract one line of data from this image using python

1 Answers1