2

I want to mask a random number(e.g. mobile number) in an image and also in pdf too in python. I have different type of image and pdf file but I know that if there is any 10 digit number it is that number. I can find it using regex but I got stuck during masking. Plz help me to resolve this issue.

for image file:

from PIL import Image, ImageEnhance, ImageFilter 
import pytesseract
text = pytesseract.image_to_string(Image.open(filepath))
text = re.sub(r'(?i)(\d{10})','xxxxxxxxxx', text)

for PDF file:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = 'filepath'
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
    print('Processing next page...')
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('%r: %s' % ((x, y), text))

1 Answers1

0

You may want to try these.

Since you have the coordinates of the bounding box in every line from tesseract, you could use that to effectively stretch the resulting bounding box across the ten numbers and then place it back on the image as a mask.

  1. How to get the co-ordinates of the text recogonized from Image using OCR in python
  2. how to get character position in pytesseract
  3. https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files
CypherX
  • 7,019
  • 3
  • 25
  • 37