Returning formatted text from GCP Vision PDF results

Question

I finally got my script to submit PDF document to Google Storage and then extract Text using Google Vision for PDF, as described in documentation.

The data is returned in a huge JSON file. There's one node that contains test, but it's no longer formatted. Only line breaks are delineated with \n. I don't really care so much about the line breaks, as much as paragraphs.

How can I return it formatted? Are there any libraries that would work with GCP to enhance JSON output?

Is feeding the result through jq an option? See https://stedolan.github.io/jq/ . Also see this: https://stackoverflow.com/questions/36728347/cloud-vision-api-pdf-ocr . — , May 26 '19 at 23:36
Looks interesting. But the JSON file I get from GCP is hude, I can't even try it out on /jq play online... — santa, May 26 '19 at 23:47
I would --if possible-- pipe it through jq locally. That way filesize doesn't really play much of a role. Example in nearly its simplest form: `cat foo.json | jq .` (note the dot). — , May 27 '19 at 00:29
What do you mean by 'formatted'? What is your current observation and how would you like the output to be. Please elaborate. — Tom, May 28 '19 at 08:59
I don't care so much about line breaks but but I would like to preserve new lines and paragraphs. Definitely want entire document in one output and not separate files and remove headers and footers. Here's a link to test file I was working with: https://docdro.id/NyFyxJq — santa, May 28 '19 at 19:13
Did you try using MS Word instead? You open PDF in Word, save to xml, and retrieve data from xml file. — RobertBaron, Jun 02 '19 at 11:15
I am not familiar with php, but Word can be automated from code in several languages. Is your code run by a web server or by some interactive application? I am asking because attempting to run Word on a server without a desktop will not work. So, if automating Word is possible in your case, then I suggest that you do a manual test. Open a pdf in Word, and save it to xml. Not all pdf yield easy to extract xml table data. — RobertBaron, Jun 03 '19 at 16:54

Returning formatted text from GCP Vision PDF results

0 Answers0