Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

Question

I am trying to get page level ASCII text out of a series of multi-page PDFs. My current process is to split all of the PDFs with Sejda (an awesome tool) in batch and then extract text from the divided PDFs (in Sejda as batch) to corresponding text files. Is there an easy way to bypass the splitting phase and go straight to the page-level TXT files? I would like to just input a collection of multi-page PDFs and OUTPUT a corresponding TXT files for each page of each PDF. Any input or insight would be appreciated.

My process

File.pdf --> File-001.pdf; File-002.pdf; etc. --> File-001.txt; File-002.txt; etc

Since you mentioned Sejda, the feature you are talking about is planned but not yet implemented, maybe you want to keep on eye on it [here](https://github.com/torakiki/sejda/issues/85) — Andrea Vacondio, Oct 25 '13 at 07:46

score 1 · Answer 1 · answered Oct 26 '13 at 11:47

1

Sejda version 1.0.0.M8 has the task that you are looking for: ExtractTextByPages

Example usage from the command line:

bin/sejda-console extracttextbypages -f /tmp/file.pdf -o /tmp -e "UTF-8" --pageNumbers 1 3 5

answered Oct 26 '13 at 11:47

Edi

621
6
17

Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

1 Answers1