I've extracted text from pdf by using pdftools
and saved the result as txt.
Is there an efficient way to convert the txt with 2 columns to a file with one column.
This is an example of what I have:
Alice was beginning to get very into the book her sister was reading,
tired of sitting by her sister but it had no pictures or conversations
on the bank, and of having nothing in it, `and what is the use of a book,'
to do: once or twice she had peeped thought Alice `without pictures or conversation?`
instead of
Alice was beginning to get very tired of sitting by her sister on the bank, and
of having nothing to do: once or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it, `and what is the use of a
book,' thought Alice `without pictures or conversation?'
Based on Extract Text from Two-Column PDF with R I modified the function a bit to obtain:
library(readr)
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x, perl=TRUE)
QTD_COLUMNS = 2
read_text = function(text) {
result = ''
#Get all index of " " from page.
lstops = gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result = sapply(text, function(x){
start = 1
stop =stops[i]
if(i > 1)
start = stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop = nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result = trim(temp_result)
result = append(result, temp_result)
}
result
}
txt = read_lines("alice_in_wonderland.txt")
result = ''
for (i in 1:length(txt)) {
page = txt[i]
t1 = unlist(strsplit(page, "\n"))
maxSize = max(nchar(t1))
t1 = paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result
But no luck with some of the files. I wonder if there's a more general/better regular expression to achieve the result.
Many thanks in advance !