Text extraction from multiple image files using Python

Question

I have a folder containing multiple image files. I want to extract text from these files and have the output saved as csv file with 2 columns, 1st column: Image_no., 2nd column: Text.

TIA

I have tried this code on Python:

img_dir = "MyFolder" # Folder name containing image files 
data_path = os.path.join(img_dir,'*g')
files = glob.glob(data_path)
data = []
for f1 in files:
    img = cv2.imread(f1)
    x=data.append(img)

Q1: How can I see the text that is extracted from each image? Q2: How can I export the image name & the corresponding text to csv?

your current code is simply appending all the pixels from each image into a single list. look into tesseract or other OCR libraries you can easily integrate with OpenCV, for example: https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/ — George Profenza, Jul 30 '19 at 10:53

score 3 · Answer 1 · answered Nov 07 '19 at 15:02

Part 1:

Please install Tesseract and pytesseract

pip install pillow
pip install pytesseract
pip install tesseract

Reference links:

Part 2:

from PIL import Image
import pytesseract
import os
import pandas as pd

# Path is given for for 64 bit installer
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe"

f = []
t = []
input_dir = r'C:/Users/suhas/Downloads/images/'

for root, dirs, filenames in os.walk(input_dir):
    for filename in filenames:
        try:
            print(filename)
            f.append(filename)
            img = Image.open(input_dir+ filename)
            text = pytesseract.image_to_string(img, lang = 'eng')
            t.append(text)
            print(text)
            print('-='*20)
        except:
            continue


df = pd.DataFrame(list(zip(f, t)),columns=['file_Name','Text'])

Output:

                     file_Name      Text
0   Screenshot_20191104-130254.png  MNP_6050
1   Screenshot_20191104-130336.png  MNP_6039
2   Screenshot_20191104-130943.png  MNP_6116
3   Screenshot_20191104-131248.png  MNP_6093
4   Screenshot_20191104-230714.png  MNP_6013
5   Screenshot_20191104-230834.png  MNP_6006

PS: In order to get clean text you may need to use Regex

Reference Links:

Text extraction from multiple image files using Python

1 Answers1