0

I'm trying to convert many PDF documents into text in R in order to use string parsing and regex to extract a set of codes from it. I am using ocr from the tesseract library and though it works on many of the pages, it does miss a lot of information that I need.

I identified the problem being inconsistent line breaks in the image/PDF. For example: THIS

I am trying to get the codes from the left column. The only codes that I'm able to extract successfully are the ones where the description is longer than a single line.

I've experimented with various pre-processing techniques using magick but have come up short in most cases. The only instance where I was able to get the code set was cropping the right-hand side out of the image, but unfortunately this is not an efficient solution in my case.

file <- magick::image_read("44F245A2-5FEE-408F-A197-756436A5CAFD.png")

file %>%
  magick::image_resize("2000x") %>%
  magick::image_convert(type = 'Grayscale') %>%
  tesseract::ocr() %>%
  cat()

# or
# descriptions in this document.
# 94942C This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | terpenes Steet gine see
# 272144 This is a description that takes on multiple lines. It can contain any combination of
# eee
# length of the description could be anywhere from 1 line to 5 lines of text.
# E76744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# [terpenes Steet gine see
# K77744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | terrane een Steet gine seem
# 172744 This is a description that takes on multiple lines. It can contain any combination of
# Se
# length of the description could be anywhere from 1 line to 5 lines of text.
# A71744 This is a description that takes on multiple lines. It can contain any combination of
# alphanumeric characters or punctuation. Different types of things can go in here and the
# | teammates Steet gine see

Ideally I would like to be able to get all of the codes from the image in the above link. Any help would be awesome.

jon
  • 370
  • 1
  • 11
  • The problem is that the text is underlined. Tesseract has difficulty detecting lines of text when they are underlined. Perhaps you could try using Leptonica or something similar to detect and delete the undelining. – Grada Gukovic Jul 18 '19 at 16:01

1 Answers1

0

Try to use different page segmentation modes, the available segmentation modes are:

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Tre at the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,

Try PSM #4 for your case, from my experience #12 gives the most text, but it might not be in order, which might be an issue if you want to relate the codes with the descriptions.

victormeriqui
  • 151
  • 2
  • 10