a standalone PHP library, provides various tools to extract data from a PDF file
Questions tagged [pdfparser]
39 questions
5
votes
0 answers
Read pdf page one at a time - Pdf.js
I am trying to parse a pdf with more than 300 page. I am using pdf-parse npm package.
The pdf has 300 pages. But my application crashes to while parsing the pdf.
My question is that is there way by which i can parse one page at a time?
Below is the…
user10090131
4
votes
3 answers
Read specific value based on label name from PDF in C#
I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and…

prog1011
- 3,425
- 3
- 30
- 57
3
votes
2 answers
Why do pdf parsing libraries pdf2json and pdf-parse seem to not work with Next JS app router?
I've been trying to implement pdf parsing logic in my Next JS app. It seems the libraries pdf2json and pdf-parse don't work with the new Next JS app router.
Steps to reproduce:
Run npx create-next-app@latest and follow the prompts, and say Yes to…

Andrew Luo
- 31
- 1
3
votes
2 answers
Arabic pdf text extraction
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either…

B.A
- 45
- 4
1
vote
0 answers
Extracting specific data via coordinates using php pdfParser
I want to extract specific data from various pdfs that are 3-4 pages each.
I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.
So i was looking the…

ThunderBoy
- 391
- 1
- 3
- 18
1
vote
1 answer
Issue using Apache tika parser when trying to parse pdf having text contains image
I am using these two dependencies:-
tika core 2.6.0
tika parser standard package 2.6.0
.Parsing is working fine for these cases:-
pdf file with text.
pdf file with images.
text files and other extensions.
Parsing is failing with pdfparser runtime…

DeadPool
- 40
- 8
1
vote
0 answers
I have a error when i use parseFIle function with pdfparser
I wan't to parse a file with : https://github.com/smalot/pdfparser
The problem
When i use $parser->parseFile($pathToPdf) I got this :
Argument 1 passed to Smalot\PdfParser\Parser::parseHeader() must be of the type array, string given, called in…

LocDog
- 36
- 4
1
vote
1 answer
How to decode PDF file and encode it back?
My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.
My usual strategy was to use an old software called "Pdfedit"…

Smogshaik
- 180
- 2
- 13
1
vote
0 answers
Getting the page size from uploaded PDF metadata file in my PHP code
Here I used a PDF parser PHP library:
parseFile('ss.pdf');
// Retrieve all details…

Steven Ragy
- 11
- 2
1
vote
1 answer
Parsing PDF and getting the header portion information
Am trying to parse the contents of PDFs. Basically they are scientific research papers.
Here's the portion am trying to grab:
I only need the paper title and the author name(s).
What I used is the PDF Parser Library. And I was able to get the…

Akhilesh B Chandran
- 6,523
- 7
- 28
- 55
1
vote
0 answers
getting same junk when extracting hindi / devnagri text from pdf by pdftotext or pdfparser
I am using php Pdfparser and pdftotext to extract hindi/ devnagri text from pdf. But I am getting the same kind of junk or garbage using both of the above mentioned.
Junk, for example :
f{kfrt114; rhanz feJ dk tUe lu~ 1977 esa v;ksè;k (mÙkj…

KJA
- 85
- 5
1
vote
1 answer
pdfparser from pdfminer: PDFException: PDFDocument is not initialized
I'm not understanding this error. I want to open a pdf and loop over the pages but I'm getting this exception and I couldn't find much by googling it.
Here is the example that fails
from pdfminer.pdfparser import PDFParser, PDFDocument
from os.path…

Atirag
- 1,660
- 7
- 32
- 60
1
vote
0 answers
Getting empty combo box value from PDF file in express js
I'm getting empty combo box value from PDF file using 'pdf2json' parser in express.js. The value on PDF file showing the different option inside the combo box and it also storing state of the selection while saving the file, but when I try to parse…

jasmeetsohal
- 141
- 1
- 2
- 11
1
vote
0 answers
TCPDF_PARSER ERROR: Invalid object reference: Array
I'm using library PDFparser (https://github.com/smalot/pdfparser) to convert PDF file to text.
When I try to convert a file on a local web-server, it parses OK. When I try to convert a file on remote web-server, it fails with the following error:…

Александр Чи
- 129
- 8
0
votes
1 answer
Read pdf-content in next.js 13 api route-handler results in 404
I have followed this tutorial (https://www.youtube.com/watch?v=enfZAaTRTKU) on youtube which teaches one how to upload a pdf-file a to an express server and read out its content.
I do not want to display the pdf - I only care about the text.
I have…

frankBang
- 117
- 1
- 11