1

I've successfully extracted text from a single pdf using a combination of magick-r and tesseract, but have hit a roadblock when trying to process multiple images.(It is for a non-profit organization)

I welcome answers in bash, but ask that they are comprehensive and don't skip the tesseract component.

The answers to this question are for image cleaning without using OCR, so not sure how the first answer can be integrated here.

image data: enter image description here

My process:

library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf

#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]

file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages  

multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
                                          filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))

#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))

To read the images as magick files:

multi_images <- map(multi_files, image_read)

which creates a tibble magick pointer object with the images sort of joined as a frame:

[[1]]
# A tibble: 5 x 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG     3243   2010 sRGB       FALSE        0 98x98  
2 PNG     3247   2013 sRGB       FALSE  4515441 98x98  
3 PNG     3243   2013 sRGB       FALSE  4559229 98x98  
4 PNG     3247   2010 sRGB       FALSE  4270145 98x98  
5 PNG     3247   2010 sRGB       FALSE  3212528 98x98  

How do I access this on each PNG so I can clean and process in an OCR?

multi_text_clean <- function(images){

  Map(function(x) {
    x %>% 
      image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%  
      image_resize("2000x") %>%
      image_background("white", flatten = TRUE) %>% 
      image_noise(noisetype = "Uniform") %>%          # Reduce noise in image using a noise peak elimination filter
      image_enhance() %>%                             # Enhance image (minimize noise)
      image_normalize() %>% 
      image_convert(type = 'Grayscale') %>%
      image_trim(fuzz = 40) %>%
      image_contrast(sharpen = 1) %>%
      #image_deskew(threshold = 40) %>% 
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
  }, images)

}

This only runs it on the first image:

text_list <-  multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))

[[1]]
 [1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
 [3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."                                                                                                                                                                                                                                                                                                                          
 [5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit   Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
 [7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [8] ":"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[11] ":\npositions for all around defence."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[12] "4\n"                                                                                                                                                                                                                             

How can I run this through each image in that magick object?

lobati
  • 9,284
  • 5
  • 40
  • 61
Corey Pembleton
  • 717
  • 9
  • 23
  • 1
    I suggested using **GNU Parallel** for this sort of thing in a couple of other answers... https://stackoverflow.com/a/45032643/2836621 and here https://stackoverflow.com/a/44821317/2836621 – Mark Setchell Oct 04 '19 at 17:01
  • Thanks, I see now and in conversation below that this will be the approach I'll use (process w. parallel and command-line magick) and I'll import those images into tesseract in R after. A shame it all can't be done in R but I'm glad to be forced out of my comfort zone! – Corey Pembleton Oct 04 '19 at 17:08

2 Answers2

0

Here's a script I put together (using many examples from stack overflow), which processes multiple .pdf's in a directory (or just one..) -- maybe it'll be of some help?

You can download the script at: https://drive.google.com/file/d/1fB9P0TQchE6vEr2MBug47aJIPc4yag45/view?usp=sharing

#!/bin/bash echo -ne "\033]0;CREATE SEARCHABLE PDF MULTIPLE .PDF FILES\007" # set terminal title

note to users

# install ImageMagick
# install tesseract
# install pdftk
# install libtiff

get current directory

current_dir=$(pwd)

set temporary director name

temp_dir_nme=pdf_OCR_temp_nkAIumgy430qIRVn3Np6ZQx

warn user no spaces in files; files will be renamed: \e[1;31m (red colour on) | \033[0m": colour off

echo -e "\e[1;31mTAKE NOTE that directory names containing spaces are unsupported!\033[0m"
printf "\n"

echo -e "\e[1;31mTAKE NOTE FURTHER that your .pdf file/files will be renamed to replace any spaces in their names with underscores!\033[0m"
printf "\n"

give user an opportunity to back out...

read -p "Press enter to continue..."
printf "\n"

rename .pdf files to replace spaces with "underscore"

for f in *.pdf
    do mv "$f" "${f// /_}"
done

# run script for each .pdf file in folder
for f in *.pdf
    do

    # establish path to input .pdf file
        path_to_file="$current_dir/"$f

    # make temp directory for operations
        mkdir $current_dir/$temp_dir_nme

    # copy .pdf file to temp directory
        cp $f $current_dir/$temp_dir_nme

    # change to temp directory to work the magick
        cd $current_dir/$temp_dir_nme

        no_pgs=$(pdftk $f dump_data | grep NumberOfPages | awk '{print $2}')
        pgs_per_vol=10 # for .pdf's of more than ten pages
        min_volumes=$(( no_pgs / pgs_per_vol )) # for .pdf's of more than ten pages
        fin_volume=$(( min_volumes+1 )) # for .pdf's of more than ten pages
        unlik_nme="nkAIumgy430qIRVn3Np6ZQx_" # give .pdf volumes unlikely names

        let "ss = $min_volumes * 10" 1> /dev/null
        let "tt = $ss + 1" 1> /dev/null

    # chop .pdf into volumes

    # chop .pdf into one volume with a new name if it has ten or less pages
        if [ $no_pgs -lt 11 ]
            then
            pdftk $f cat 1-$no_pgs output $unlik_nme.pdf
        fi

    # chop .pdf into multiple volumes if it has eleven or more pages, excluding the final volume
        if [ $no_pgs -gt 10 ]
            then
                echo Chopping $f into $fin_volume volumes, to a maximum of ten pages per volume...
    
                i=1
                j=1
                k=$pgs_per_vol

                pdftk $f cat 1-$pgs_per_vol output "${unlik_nme}${i}".pdf # concatenate variables

                while [ $i -ne $min_volumes ]
                    do  
                        j=$(( $j + pgs_per_vol ))
                        k=$(( $k + pgs_per_vol ))
                        i=$(( $i + 1 ))
        
                        pdftk $f cat $j-$k output "${unlik_nme}${i}".pdf
                    done

        # create final volume of whatever number of pages           
            pdftk $f cat $tt-end output "${fin_volume}${unlik_nme}".pdf 2> /dev/null
        fi

    # remove initial .pdf file
        rm $f

    # rename pdf volumes in directory sorted by modification time, oldest first: ls -tr
        n=0; ls -tr | while read a; do n=$(( n+i )); mv -- "$a" "$(printf '%03d' "$n")"_"$a"; done
    
        total_vols=$(( $min_volumes + 2 ))

    # loop over .pdf volumes in directory

        for files in *.pdf
            do
                echo Exporting $files to .png images...

            # export .pdf volume  to .png images
                pdftoppm -r 150 $files exported -png

            # delete .pdf volume
                rm $files
                echo Converting .png files to .jpg files...

            # convert .png files to .jpg files
                magick convert *.png %03d_converted.jpg

            # delete first .png images
                rm *.png
                
            # deskew images
                echo Deskewing text...
                magick convert *.jpg -deskew 90% %03d_deskewed.jpg

            # delete converted .jpf images
                rm *converted.jpg

            # enhance contrast
                echo Enhancing contrast...
                magick convert -brightness-contrast 0x10 *.jpg %03d_contrast.jpg

            # delete deskewed images
                rm *deskewed.jpg

            # crop and resize .jpg images to A4 ratio
                echo Resizing, and cropping...
                magick mogrify -format jpg -geometry "1680x2376^" -gravity center -extent 1680x2376 *.jpg

            # generate compressed .tiff image from .jpg images
                echo Converting resized and cropped .jpg"'s" to a compressed .tiff file...
                magick convert -compress lzw *.jpg images.tiff

            # delete .jpg images
                rm *.jpg
        
            # create .pdf
                magick convert images.tiff images.pdf
                printf "\n"
                echo Recognizing text...
                
            # recognize text
                tesseract images.tiff text -l eng -c textonly_pdf=1 pdf
                printf "\n"

            # delete .tiff to save space
                rm images.tiff

            # add "OCR_" to original .pdf name
                pdftk text.pdf multibackground images.pdf output "OC_"$files

            # compressing .pdf
                ps2pdf -dPDFSETTINGS=/ebook "OC_"$files "CR_"$files
        
        # end files loop
            done

# combine output .pdf files into "OCR"_original_file_name.pdf   
    pdftk CR_*.pdf output "OCR_"$f

# copy "OCR"_original_file_name.pdf to initial directory
    cp "OCR_"$f $current_dir

# change to previous directory
    cd $current_dir

# delete temporary directory, and temporary files
    rm -r $current_dir/$temp_dir_nme

    done

print line

printf "\n"
printf '=%.s' {1..40}; echo
read -p "Press enter to exit..."
Markus
  • 1
  • 2
-1

You can do the following in ImageMagick.

Input:

enter image description here

convert img.jpg -negate -lat 20x20+10% -negate img_lat.jpg


enter image description here

Or I have a bash shell script that uses ImageMagick, called textcleaner, that will do the following:

textcleaner -f 20 -o 10 img.jpg img_textcleaner.jpg


enter image description here

fmw42
  • 46,825
  • 10
  • 62
  • 80
  • I'm not sure how this answers and part of my question of a) extracting multiple images from a single pdf b) processing with OCR/tesseract to extract text. Are there subsequent recommended steps or approaches you recommend? – Corey Pembleton Oct 04 '19 at 16:10
  • Sorry, I misunderstood the question. I thought you just wanted to improve the readability of the text as preprocessing before tesseract. – fmw42 Oct 04 '19 at 16:12
  • Thank you, but I think I have that part down well (in R at least) using some magick functions (which work great!), its the scaling part that I don't know! – Corey Pembleton Oct 04 '19 at 16:13
  • Do you need something other than a script loop? In ImageMagick command line, you can use mogrify to process a whole folder of images. – fmw42 Oct 04 '19 at 16:15
  • So do the first conversion in R from pdf-png, place the images from all pdf's into their respective folders, run mogrify on them to process, then run those images into tesseract in R? – Corey Pembleton Oct 04 '19 at 16:17
  • You can process all the pdf in a folder using the -lat method using mogrify if you use the same density to rasterize all of the pdf files. Are your pdfs single pages or multiple pages? – fmw42 Oct 04 '19 at 16:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200405/discussion-between-corey-pembleton-and-fmw42). – Corey Pembleton Oct 04 '19 at 16:22