3

I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?

I am not sure if here is the best place to ask this either.

My application is in Rails 4.2. PDF file has contains a header and a big table with about 10 columns.

More info: User upload the PDF via a form then I need to grab the PDF parse it to CSV and read the content. I tried to read the content with PDF Reader Gem however the result wasn't really promising.

I have used: freepdfconvert.com/pdf-excel Unfortunately then don't supply API. (I have contacted them)

Sample PDF

enter image description here

This piece of code convert the PDF into the text which is handy. Gem: pdf-reader

 def self.parse
    reader = PDF::Reader.new("pdf_uploaded_by_user.pdf")
    reader.pages.each do |page|
      puts page.text
    end
  end

Now if you check the sample attached PDF you will see some fields might be empty which it means I simply can't split the text line with space and put it in an array as I won't be able to map the array to the correct fields.

Thank you.

Mr H
  • 5,254
  • 3
  • 38
  • 43
  • Really?? Why do you gave it a minus point. I asked a question. You could have simply said this is not belong here not to give it a minus point >:( – Mr H May 12 '15 at 06:55
  • are you generating pdf from your program or its an external pdf?? – Santhucool May 12 '15 at 06:56

3 Answers3

4

Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.

I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.

Then I convert the HTML table to CSV.

(This is not ideal but it works)

Here is the code:

require 'httmultiparty'
class PageTextReceiver
  include HTTMultiParty
  base_uri 'http://localhost:3000'

  def run
    response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })

    File.open('/path/to/save/as/html/response.html', 'w') do |f|
      f.puts response
    end
  end

  def convert
    f = File.open("/path/to/saved/html/response.html")
    doc = Nokogiri::HTML(f)
    csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
    doc.xpath('//table/tr').each do |row|
      tarray = []
      row.xpath('td').each do |cell|
        tarray << cell.text
      end
      csv << tarray
    end
    csv.close
  end
end

Now Run it like this:

#> page = PageTextReceiver.new
#> page.run
#> page.convert

It is not refactored. Just proof of concept. You need to consider performance.

I might use the gem Sidekiq to run it in background and move the result to the main thread.

Mr H
  • 5,254
  • 3
  • 38
  • 43
  • Very nice solution!! How did the refactor come together? Were you able to improve on the solution? – josh p Apr 15 '16 at 23:36
  • 1
    No sorry, I delivered the project, and moved on. The `pdftable` API has improved since I used it though. Good Luck. Client has been using it regularly and no bug or crash reported. I used `sidekiq` Gem just make things perform better. Same approach though. – Mr H Apr 17 '16 at 11:06
2

Check Tabula-Extractor project and also check how it is used in projects like NYPD Moving Summonses Parser and CompStat criminal complaints parser.

Eugene
  • 2,820
  • 19
  • 24
1

Ryan Bates covers csv exports in his rails casts > http://railscasts.com/episodes/362-exporting-csv-and-excel this might give you some pointers.

Edit: as you now mention you need the raw data from an uploaded PDF, you could use JavaScript to read the PDF file and the populate the data into Ryan Bates' export method. Reading PDF's was covered excellently in the following question:

extract text from pdf in Javascript

I would imagine the flow would be something like:

PDF new action
    user uploads PDF 

PDF show action
    PDF is displayed
    JavaScript reads PDF
    JavaScript populates Ryan's raw data
    Raw data is exported with PDF data included 
Community
  • 1
  • 1
abbott567
  • 862
  • 6
  • 18
  • Thank you for your response. I have had a look it is no help in this matter. My PDF needs to be read then converted into CSV what Ryan is indicating is converting to CSV from raw data. – Mr H May 12 '15 at 07:00
  • You didn't say what you had tried, so I wasn't to know that was no help. Perhaps that is why your question was marked down by another user. I have edited my answer with another resource to show how you could read the PDF and populate the export data =) I hope this helps – abbott567 May 12 '15 at 07:06
  • Ok, if you look at the sample Table that I have attached when the script return it as a text I will get a row with `\n` at the end. The when I convert it into the CSV I will get all of the fields in one line and then next line will have all of the fields. I will generate it and put it in the question now. – Mr H May 12 '15 at 07:18