Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

pdftotext is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the poppler-utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions

votes

7 answers

CLI pdf viewer for linux

Hey, for quite a while now, I am looking for a pdf viewer for the command line. As I like to work without X on Linux, and often work on a remote machine, I would like to have a tool to read pdfs. There are quite a lot of really good graphical…

linux pdf command-line ncurses pdftotext

asked Aug 25 '10 at 22:03

bitmask

32,434
14
99
159

votes

4 answers

How to wait for a stream to finish piping? (Nodejs)

I have a for loop array of promises, so I used Promise.all to go through them and called then afterwards. let promises = []; promises.push(promise1); promises.push(promise2); promises.push(promise3); Promise.all(promises).then((responses) => { …

node.js asynchronous promise pipe pdftotext

asked Jun 15 '16 at 13:38

ThePumpkinMaster

2,181
5
22
31

votes

7 answers

Unable to install pdftotext on Python 3.6, missing poppler

How can I install pdftotext properly? I'm getting the error message below when installing pdftotext in Python 3.6. I also tried to install the package manually by downloading the zip file but still got the same error. pdftotext/pdftotext.cpp(4):…

python installation pdftotext

asked Aug 28 '17 at 06:08

mtryingtocode

votes

7 answers

How to extract table data from PDF as CSV from the command line?

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices. pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \ | sed '$d' …

pdf grep pdftotext

asked May 18 '15 at 18:28

user706838

5,132
14
54
78

votes

2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…

r text-mining tm pdftotext

asked Jan 30 '14 at 00:33

S Das

3,291
6
26
41

votes

5 answers

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks

node.js pdf pdftotext

asked Jun 09 '15 at 13:38

bartium

votes

2 answers

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to…

utf-8 pdftotext

asked Oct 28 '10 at 05:07

Amar

votes

2 answers

Extract table data from PDF

Is there any consistent way to extract tables from PDF files? Any tools? What I have done so far: I have tried out pdftotext tool. It has an option to convert to HTML layout. What is the problem with this: The table information is not preserved…

pdf pdftotext pdf-to-html

asked May 06 '14 at 12:56

Rajneesh

2,185
4
20
30

votes

2 answers

Using two commands (using pipe |) with spawn

I'm converting a doc to a pdf (unoconv) in memory and printing (pdftotext) in the terminal with: unoconv -f pdf --stdout sample.doc | pdftotext -layout -enc UTF-8 - out.txt Is working. Now i want use this command with child_process.spawn: let…

node.js child-process spawn pdftotext unoconv

asked Jul 08 '16 at 18:29

user5526811

votes

2 answers

Remove a page number, header and footer from pdf file

I want to parse a pdf file, for that I am using pdftotext utility which converts pdf file into text file, now I want to remove a page number, header and footer from text file. I am converting a pdf file using following syntax: pdftotext -layout…

pdftotext

asked Jan 12 '15 at 11:44

Deepti Kakade

3,053
3
19
30

votes

3 answers

Extract Text Using PdfMiner and PyPDF2 Merges columns

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link [edit: link was broken / pointed to potential malware] I am good with any type of output (file/string). Here…

python pypdf pdftotext

asked Apr 01 '13 at 04:54

user2151334

votes

2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…

python pdfminer pdftotext named-entity-recognition nlp

asked Mar 03 '18 at 18:35

Aryan

votes

6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…

python pdf pdftotext pdfminer pdf-parsing

asked Oct 20 '16 at 15:28

Danil

4,781
1
35
50

votes

1 answer

Extracting data from Invoices in pdf or image format

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic…

parsing ocr invoice pdftotext tabula

asked May 23 '19 at 15:01

Rajesh Gosemath

1,812
1
17
31

votes

0 answers

Returning formatted text from GCP Vision PDF results

I finally got my script to submit PDF document to Google Storage and then extract Text using Google Vision for PDF, as described in documentation. The data is returned in a huge JSON file. There's one node that contains test, but it's no longer…

php pdf google-vision pdftotext pdf-to-html

asked May 23 '19 at 00:45

santa

12,234
49
155
255

2 3

…

24 25 Next