Questions tagged [pdfparser]

a standalone PHP library, provides various tools to extract data from a PDF file

See https://github.com/smalot/pdfparser

39 questions
5
votes
0 answers

Read pdf page one at a time - Pdf.js

I am trying to parse a pdf with more than 300 page. I am using pdf-parse npm package. The pdf has 300 pages. But my application crashes to while parsing the pdf. My question is that is there way by which i can parse one page at a time? Below is the…
user10090131
4
votes
3 answers

Read specific value based on label name from PDF in C#

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and…
prog1011
  • 3,425
  • 3
  • 30
  • 57
3
votes
2 answers

Why do pdf parsing libraries pdf2json and pdf-parse seem to not work with Next JS app router?

I've been trying to implement pdf parsing logic in my Next JS app. It seems the libraries pdf2json and pdf-parse don't work with the new Next JS app router. Steps to reproduce: Run npx create-next-app@latest and follow the prompts, and say Yes to…
Andrew Luo
  • 31
  • 1
3
votes
2 answers

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -. I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either…
B.A
  • 45
  • 4
1
vote
0 answers

Extracting specific data via coordinates using php pdfParser

I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want. So i was looking the…
ThunderBoy
  • 391
  • 1
  • 3
  • 18
1
vote
1 answer

Issue using Apache tika parser when trying to parse pdf having text contains image

I am using these two dependencies:- tika core 2.6.0 tika parser standard package 2.6.0 .Parsing is working fine for these cases:- pdf file with text. pdf file with images. text files and other extensions. Parsing is failing with pdfparser runtime…
DeadPool
  • 40
  • 8
1
vote
0 answers

I have a error when i use parseFIle function with pdfparser

I wan't to parse a file with : https://github.com/smalot/pdfparser The problem When i use $parser->parseFile($pathToPdf) I got this : Argument 1 passed to Smalot\PdfParser\Parser::parseHeader() must be of the type array, string given, called in…
LocDog
  • 36
  • 4
1
vote
1 answer

How to decode PDF file and encode it back?

My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not. My usual strategy was to use an old software called "Pdfedit"…
Smogshaik
  • 180
  • 2
  • 13
1
vote
0 answers

Getting the page size from uploaded PDF metadata file in my PHP code

Here I used a PDF parser PHP library: parseFile('ss.pdf'); // Retrieve all details…
1
vote
1 answer

Parsing PDF and getting the header portion information

Am trying to parse the contents of PDFs. Basically they are scientific research papers. Here's the portion am trying to grab: I only need the paper title and the author name(s). What I used is the PDF Parser Library. And I was able to get the…
Akhilesh B Chandran
  • 6,523
  • 7
  • 28
  • 55
1
vote
0 answers

getting same junk when extracting hindi / devnagri text from pdf by pdftotext or pdfparser

I am using php Pdfparser and pdftotext to extract hindi/ devnagri text from pdf. But I am getting the same kind of junk or garbage using both of the above mentioned. Junk, for example : f{kfrt114; rhanz feJ dk tUe lu~ 1977 esa v;ksè;k (mÙkj…
KJA
  • 85
  • 5
1
vote
1 answer

pdfparser from pdfminer: PDFException: PDFDocument is not initialized

I'm not understanding this error. I want to open a pdf and loop over the pages but I'm getting this exception and I couldn't find much by googling it. Here is the example that fails from pdfminer.pdfparser import PDFParser, PDFDocument from os.path…
Atirag
  • 1,660
  • 7
  • 32
  • 60
1
vote
0 answers

Getting empty combo box value from PDF file in express js

I'm getting empty combo box value from PDF file using 'pdf2json' parser in express.js. The value on PDF file showing the different option inside the combo box and it also storing state of the selection while saving the file, but when I try to parse…
jasmeetsohal
  • 141
  • 1
  • 2
  • 11
1
vote
0 answers

TCPDF_PARSER ERROR: Invalid object reference: Array

I'm using library PDFparser (https://github.com/smalot/pdfparser) to convert PDF file to text. When I try to convert a file on a local web-server, it parses OK. When I try to convert a file on remote web-server, it fails with the following error:…
0
votes
1 answer

Read pdf-content in next.js 13 api route-handler results in 404

I have followed this tutorial (https://www.youtube.com/watch?v=enfZAaTRTKU) on youtube which teaches one how to upload a pdf-file a to an express server and read out its content. I do not want to display the pdf - I only care about the text. I have…
frankBang
  • 117
  • 1
  • 11
1
2 3