2

I've extracted text from pdf by using pdftools and saved the result as txt.

Is there an efficient way to convert the txt with 2 columns to a file with one column.

This is an example of what I have:

Alice was beginning to get very      into the book her sister was reading,
tired of sitting by her sister       but it had no pictures or conversations
on the bank, and of having nothing   in it, `and what is the use of a book,' 
to do: once or twice she had peeped  thought Alice `without pictures or conversation?`

instead of

    Alice was beginning to get very tired of sitting by her sister on the bank, and 
of having nothing to do: once or twice she had peeped into the book her sister was 
reading, but it had no pictures or conversations in it, `and what is the use of a 
book,' thought Alice `without pictures or conversation?'

Based on Extract Text from Two-Column PDF with R I modified the function a bit to obtain:

library(readr)    
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x,  perl=TRUE)

QTD_COLUMNS = 2

read_text = function(text) {
  result = ''
  #Get all index of " " from page.
  lstops = gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result = sapply(text, function(x){
      start = 1
      stop =stops[i] 
      if(i > 1)            
        start = stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop = nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result = trim(temp_result)
    result = append(result, temp_result)
  }
  result
}

txt = read_lines("alice_in_wonderland.txt")

result = ''

for (i in 1:length(txt)) { 
  page = txt[i]
  t1 = unlist(strsplit(page, "\n"))      
  maxSize = max(nchar(t1))
  t1 = paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}

result

But no luck with some of the files. I wonder if there's a more general/better regular expression to achieve the result.

Many thanks in advance !

pachadotdev
  • 3,345
  • 6
  • 33
  • 60
  • I'd be tempted to locate a non-PDF alternative. And if you want to use that specific story, there's a plain text version here: http://www.gutenberg.org/files/11/11-0.txt. Failing that, look for another PDF to text conversion tool which will convert to 1-column output. – neilfws Jun 01 '17 at 03:40
  • 1
    Looks like a fixed width file - `dat <- read.fwf(file, widths=c(37,48), stringsAsFactors=FALSE)` would give you a very good start if there is always a constant width in the two columns. – thelatemail Jun 01 '17 at 04:03
  • 1
    What [saved my sanity](https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html) was realizing that `pdftohtml` has a very useful XML output mode. – Sinan Ünür Jun 06 '17 at 14:57

2 Answers2

0

Looks like a fixed width file if there is always a constant width in the two columns:

dat <- read.fwf(textConnection(txt), widths=c(37,48), stringsAsFactors=FALSE)
gsub("\\s+", " ", paste(unlist(dat), collapse=" "))

Which will put it all in one big long string:

[1] "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?"
thelatemail
  • 91,185
  • 12
  • 128
  • 188
0

With the fixed-width left column, we can split each line into the first 37 chars and the rest, adding these to strings for the left and right column. For instance, with regex

use warnings;
use strict;

my $file = 'two_column.txt'
open my $fh, '<', $file or die "Can't open $file: $!";

my ($left_col, $right_col);

while (<$fh>) 
{
    my ($left, $right) = /(.{37})(.*)/;

    $left =~ s/\s*$/ /;

    $left_col  .= $left;
    $right_col .= $right;
}
close $fh;

print $left_col, $right_col, "\n";

This prints the whole text. Or join columns, my $text = $left_col . $right_col;

The regex pattern (.{37}) matches any character (.) and does this exactly 37 times ({37}), capturing that with (); the (.*) captures all remaining. These are returned by the regex, and assigned. The trailing spaces in $left are condensed into one. Both are then appended (.=).

Or from the command line

perl -wne'
    ($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r; 
     }{ print $cL,$cR,"\n"
' two_column.txt

where }{ starts the END block, that runs before exit (after all lines have been processed).

zdim
  • 64,580
  • 5
  • 52
  • 81