clean data in r from image

Question

I am trying to scan a text from an Ocr and clean it, I got a character that is divided to few lines, however I would like to have the text in similar to the way it is in the image

the code :

heraclitus<-"greek.png"
library(tidyverse)
library(tesseract)
library(magick)

image_greek<-image_read(heraclitus)

image_greek<-image_greek %>% image_scale("600") %>% 
  image_crop("600x400+220+150") %>% 
  image_convert(type = 'Grayscale') %>% 
  image_contrast(sharpen = 1) %>% 
  image_write(format="jpg")

heraclitus_sentences<-magick::image_read(image_greek)%>% 
  ocr() %>% str_split("\n")

As you can see from the output, I have white spaces and sentences that are divided to two lines. I would like to have it in a vector or a list, that each element will be a sentence

Please include the output of `dput(heraclitus_sentences)` in the question to make it reproducible. — Peter, Jun 02 '23 at 11:35

score 1 · Answer 1 · answered Jun 02 '23 at 12:00

one approach:

heraclitus_sentences <- list(c('this is', 'the first sentence',
                               '', 'and', 'this', 'the second'))

separator <- '___'

gsub('^$', separator, heraclitus_sentences[[1]]) |>
  paste(collapse = ' ') |>
  strsplit(separator)

[[1]]
[1] "this is the first sentence " " and this the second"

Please remember to present code/sample data in a format others can readily work with (no screenshots).

score 1 · Accepted Answer · answered Jun 02 '23 at 12:05

You need to split on \n\n (not \n) then replace the middle \n values:

magick::image_read(image_greek) %>% 
  ocr() %>% 
  str_split("\n\n") %>%
  unlist() %>%
  str_replace_all("\n", " ")

Output:

[1] "© Much learning does not teach understanding."                                                       
[2] "© The road up and the road down is one and the same."                                                
[3] "© Our envy always lasts longer than the happiness of those we envy."                                 
[4] "© No man ever steps in the same river twice, for it's not the same river and he's not the same man. "

clean data in r from image

2 Answers2