6

I finally got my script to submit PDF document to Google Storage and then extract Text using Google Vision for PDF, as described in documentation.

The data is returned in a huge JSON file. There's one node that contains test, but it's no longer formatted. Only line breaks are delineated with \n. I don't really care so much about the line breaks, as much as paragraphs.

How can I return it formatted? Are there any libraries that would work with GCP to enhance JSON output?

santa
  • 12,234
  • 49
  • 155
  • 255
  • Is feeding the result through jq an option? See https://stedolan.github.io/jq/ . Also see this: https://stackoverflow.com/questions/36728347/cloud-vision-api-pdf-ocr . –  May 26 '19 at 23:36
  • Looks interesting. But the JSON file I get from GCP is hude, I can't even try it out on /jq play online... – santa May 26 '19 at 23:47
  • I would --if possible-- pipe it through jq locally. That way filesize doesn't really play much of a role. Example in nearly its simplest form: `cat foo.json | jq .` (note the dot). –  May 27 '19 at 00:29
  • What do you mean by 'formatted'? What is your current observation and how would you like the output to be. Please elaborate. – Tom May 28 '19 at 08:59
  • I don't care so much about line breaks but but I would like to preserve new lines and paragraphs. Definitely want entire document in one output and not separate files and remove headers and footers. Here's a link to test file I was working with: https://docdro.id/NyFyxJq – santa May 28 '19 at 19:13
  • Did you try using MS Word instead? You open PDF in Word, save to xml, and retrieve data from xml file. – RobertBaron Jun 02 '19 at 11:15
  • @RobertBaron I'm trying to write a script to do this work. – santa Jun 03 '19 at 15:16
  • I am not familiar with php, but Word can be automated from code in several languages. Is your code run by a web server or by some interactive application? I am asking because attempting to run Word on a server without a desktop will not work. So, if automating Word is possible in your case, then I suggest that you do a manual test. Open a pdf in Word, and save it to xml. Not all pdf yield easy to extract xml table data. – RobertBaron Jun 03 '19 at 16:54
  • How many output files are there for your 4 page pdf? – Brendan Jun 06 '19 at 02:12

0 Answers0