Questions tagged [pdftotext]

Pdftotext converts Portable Document Format (PDF) files to plain text.

is a command-line utility for converting PDF files to plain text files—i.e. extracting raw text from PDF-encapsulated files.

pdftotext is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext and included as part of the -utils package on most major Linux distributions.

However, there are also others CLI-based PDF text extraction tools with a similar or equal name. While they (for the most part) work in the same way, they may give different results. So, only us this tag for CLI-based pdftotext tools and variants and make sure to point out your specific version and environment.

Do not use this tag if you use a different extraction tool, i.e. a GUI-based PDF to text converter, an online PDF to Text converter, or another (commercial) tool.

367 questions
69
votes
7 answers

CLI pdf viewer for linux

Hey, for quite a while now, I am looking for a pdf viewer for the command line. As I like to work without X on Linux, and often work on a remote machine, I would like to have a tool to read pdfs. There are quite a lot of really good graphical…
bitmask
  • 32,434
  • 14
  • 99
  • 159
63
votes
4 answers

How to wait for a stream to finish piping? (Nodejs)

I have a for loop array of promises, so I used Promise.all to go through them and called then afterwards. let promises = []; promises.push(promise1); promises.push(promise2); promises.push(promise3); Promise.all(promises).then((responses) => { …
ThePumpkinMaster
  • 2,181
  • 5
  • 22
  • 31
35
votes
7 answers

Unable to install pdftotext on Python 3.6, missing poppler

How can I install pdftotext properly? I'm getting the error message below when installing pdftotext in Python 3.6. I also tried to install the package manually by downloading the zip file but still got the same error. pdftotext/pdftotext.cpp(4):…
mtryingtocode
  • 939
  • 3
  • 13
  • 26
25
votes
7 answers

How to extract table data from PDF as CSV from the command line?

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices. pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \ | sed '$d' …
user706838
  • 5,132
  • 14
  • 54
  • 78
21
votes
2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…
S Das
  • 3,291
  • 6
  • 26
  • 41
14
votes
5 answers

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks
bartium
  • 315
  • 1
  • 5
  • 8
12
votes
2 answers

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to…
Amar
  • 257
  • 2
  • 6
  • 14
12
votes
2 answers

Extract table data from PDF

Is there any consistent way to extract tables from PDF files? Any tools? What I have done so far: I have tried out pdftotext tool. It has an option to convert to HTML layout. What is the problem with this: The table information is not preserved…
Rajneesh
  • 2,185
  • 4
  • 20
  • 30
11
votes
2 answers

Using two commands (using pipe |) with spawn

I'm converting a doc to a pdf (unoconv) in memory and printing (pdftotext) in the terminal with: unoconv -f pdf --stdout sample.doc | pdftotext -layout -enc UTF-8 - out.txt Is working. Now i want use this command with child_process.spawn: let…
user5526811
10
votes
2 answers

Remove a page number, header and footer from pdf file

I want to parse a pdf file, for that I am using pdftotext utility which converts pdf file into text file, now I want to remove a page number, header and footer from text file. I am converting a pdf file using following syntax: pdftotext -layout…
Deepti Kakade
  • 3,053
  • 3
  • 19
  • 30
9
votes
3 answers

Extract Text Using PdfMiner and PyPDF2 Merges columns

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link [edit: link was broken / pointed to potential malware] I am good with any type of output (file/string). Here…
user2151334
  • 101
  • 1
  • 1
  • 3
8
votes
2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…
Aryan
  • 81
  • 1
  • 5
7
votes
6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…
Danil
  • 4,781
  • 1
  • 35
  • 50
6
votes
1 answer

Extracting data from Invoices in pdf or image format

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic…
Rajesh Gosemath
  • 1,812
  • 1
  • 17
  • 31
6
votes
0 answers

Returning formatted text from GCP Vision PDF results

I finally got my script to submit PDF document to Google Storage and then extract Text using Google Vision for PDF, as described in documentation. The data is returned in a huge JSON file. There's one node that contains test, but it's no longer…
santa
  • 12,234
  • 49
  • 155
  • 255
1
2 3
24 25