Questions tagged [pdftextstream]

PDFTextStream is a component used for extracting text and metadata from PDF documents.

PDFTextStream is a component used for extracting text and metadata from PDF documents.

Useful link

7 questions
4
votes
1 answer

Java - Text Extraction from PDF using OCR

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text). What other OCR libraries are capable of…
Dax Amin
  • 497
  • 2
  • 5
  • 13
3
votes
2 answers

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -. I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either…
B.A
  • 45
  • 4
3
votes
1 answer

Searching through PDF text with Node.js

I have thousands of searchable PDFs, some of which are up to a 1GB with over 2000 pages. I need to be able to search for a text string in these files using a Node.js app. Right now, files are stored in a Google Cloud Storage bucket. What's the…
markkazanski
  • 439
  • 7
  • 20
1
vote
1 answer

Java - Error while using PDFTextStream

I have a PDF file and want to extract text from it. I am using PDFTextStream. I got this code from its documentation, but it gives error. import com.snowtide.PDF; import com.snowtide.pdf.Document; import com.snowtide.pdf.OutputTarget; public class…
Dax Amin
  • 497
  • 2
  • 5
  • 13
0
votes
1 answer

I am getting the error Command "python setup.py egg_info" failed

I was doing text identification and extraction from pdfs and I needed to install textract for that. However I am getting this error while installing: Command "python setup.py egg_info" failed with error code 1 in…
Kopal Sharma
  • 69
  • 1
  • 2
  • 10
0
votes
1 answer

How the value of the tj operator is generated in a pdf document (justified text)

I can't understand and find how the value of the tj operator is generated?? Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered) I think…
0
votes
1 answer

How to extract text from PDF using PDFExtStream using Java

Text is not extracted from Sample.pdf file by using pdftextstream-2.6.3.jar String filePath = "D:\\inbox\\temp\\Sample.pdf"; File document = new File(filePath); StringBuffer pdfText = new StringBuffer(1024); com.snowtide.pdf.OutputTarget tgt = new…
UdayKiran Pulipati
  • 6,579
  • 7
  • 67
  • 92