0

Hi I'm trying to read a pdf in Ruby, first of all I want to convert it into a txt. path is the path to the PDF, The point is that I get a .txt file empty, and as someone told me is a pdftotext problem, but I don't know how to fix it.

  spec = path.sub(/\.pdf$/, '')
  `pdftotext #{spec}.pdf`
  file = File.new("#{spec}.txt", "w+")
  text = []
  file.readlines.each do |l|
  if l.length > 0
    text << l
    Rails.logger.info l
  end
 end
 file.close

What's wrong with my code? Thanks!

Anna
  • 203
  • 2
  • 7
  • 23

3 Answers3

2

It's not possible to extract text from every PDF. Some PDF files use a font encoding that makes it impossible to extract text with simple tools such as pdftotext (and some PDF files are even completely immune to direct text extraction with any tool known to me -- in these cases you'll have to apply OCR first to have a chance to extract text...).

So if you test your code with the same "weird" PDF file all the time, it may well happen that you're getting frustrated over your code while in reality the fault lies with the PDF.

First make sure that the commandline usage of pdftotxt works well with a given PDF, then test (and develop further) your code with that PDF.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
1

The problem is you are opening the file in write ("w") mode, whuch truncates the file. You can see a table of file modes and what they mean at http://ruby-doc.org/core-1.9.3/IO.html.

Try something like this, it uses a pdftotext option to send the text to stdout to avoid creating a temporary file and uses blocks for more idiomatic ruby.

text = `pdftotext #{path} -`
text.split.select { |line|
  line.length > 0
}.each { |line|
  Rails.logger.info(line)
}
James Healy
  • 14,557
  • 4
  • 33
  • 43
  • Thanks, but the problem is that i get this [] when it executes this: Rails.logger.info (line) – Anna Nov 28 '12 at 12:33
0

You would need to open the txt file with write permission.

file = File.new("#{spec}.txt", "w")

You could consult How to create a file in Ruby


Update: your code is not complete and looks buggy.

  1. Cant say what is path
  2. Looks like you are trying to read the text file to which you intend to write file.readlines.each
  3. spell check length you have it l.lenght

You may want to paste the actual code.


Check this gist https://gist.github.com/4160587

As mentioned, your code is not working because you are reading and writing to the same file.

Example

Ruby code file_write.rb to do the file write operation

pdf_file = File.open("in.txt") 
output_file = File.open("out.txt", "w") # file to which you want to write
#iterate over input file and write the content to output file
pdf_file.readlines.each do |l|
    output_file.puts(l)
end
output_file.close
pdf_file.close

Sample txt file in.txt

Some text in file
Another line of text

1. Line 1
2. Not really line 2

Once your run file_write.rb you should see new file called out.txt with same content as in.txt You could change the content of input file if you want. In your case you would use pdf reader to get the content and write it to the text file. Basically first line of the code will change.

Community
  • 1
  • 1
ch4nd4n
  • 4,110
  • 2
  • 21
  • 43
  • `path' is the path to the PDF. Yes I'm trying to read it, is the way I did it not correct? – Anna Nov 28 '12 at 09:30
  • Are you still getting same error message? Update your question with the complete error stack. In the updated code you are still reading the file which you intend to write. – ch4nd4n Nov 28 '12 at 10:14
  • No, I don't get any error now, the problem is that I get an empty .txt file, the code is updated again. – Anna Nov 28 '12 at 10:31
  • I have updated the answer once again. You should not be reading the file that you intend to write to. `file.readlines.each do |l|` is wrong it should be `spec.readlines.each do |l|` – ch4nd4n Nov 28 '12 at 11:33