FScrawler: perform OCR selectively only on PDF files that do not have text

Question

I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.X). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?

So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.

Here is OCR options setting for FScrawler: https://fscrawler.readthedocs.io/en/latest/user/ocr.html

Config for case 1:

name: "Case 1"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false
    pdf_strategy: 'no_ocr'

Config for Case 2:

name: "Case 2"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'ocr_and_text'

P.S. I can sort PDFs as OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.

Hi Paul, welcome to SO. For us to be able to help you with your question, you first need to help us understand you question by providing more detail. You can provide more detail by putting some code you've tried, contents of some of the files, and elaborating more on your question. Please [edit](https://stackoverflow.com/review/first-posts/26333447) your question to help us help you. — Nico Nekoru, Jun 05 '20 at 22:35
Hi Neko, thank you for reaching out! I've added link to the docs and example of the config. There is no coding involved, just configuration. — Paul, Jun 06 '20 at 01:15

FScrawler: perform OCR selectively only on PDF files that do not have text

0 Answers0