I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.X). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?
So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.
Here is OCR options setting for FScrawler: https://fscrawler.readthedocs.io/en/latest/user/ocr.html
Config for case 1:
name: "Case 1"
fs:
url: "/path/to/data/dir"
ocr:
enabled: false
pdf_strategy: 'no_ocr'
Config for Case 2:
name: "Case 2"
fs:
url: "/path/to/data/dir"
ocr:
enabled: true
pdf_strategy: 'ocr_and_text'
P.S. I can sort PDFs as OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.