Remove a page number, header and footer from pdf file

Question

I want to parse a pdf file, for that I am using pdftotext utility which converts pdf file into text file, now I want to remove a page number, header and footer from text file.

I am converting a pdf file using following syntax:

pdftotext -layout input.pdf output.txt

Can anyone help me on this?

score 14 · Answer 1 · answered Jan 26 '16 at 01:00

14

You need crop with params -H -W -y -x, as least -H -W -y.

Example:

pdftotext -y 80 -H 650 -W 1000 -nopgbrk -eol unix example.pdf


-y 80   -> crop 80 pixels after the top of file (remove header);
-H 650  -> crop 650 pixels after the -y has cropped (remove footer);
-W 1000 -> hight value to crop nothing (need especify something);

You need adjust -y and -H to each PDF, sometimes reducing -y and increasing -H to fit with the header and footer;

answered Jan 26 '16 at 01:00

Reinaldo Gil

630
5
11

4

How to count number of pixels? – TatianaP Feb 28 '18 at 07:36
@TatianaP The default setting is 72 DPI (dots per inch), so you could measure in inches and multiply by 72. – Andrew Jul 13 '19 at 20:57
any idea how to use this if you're on windows 10? – Raghav Gupta May 13 '21 at 10:01
@RaghavGupta https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows – Reinaldo Gil May 14 '21 at 12:32
1

@ReinaldoGil I have checked all the links regarding that. Unfortunately the question is more focused towards downloading pdftotext which I have, and some solutions they have mentioned now doesn't work. I have found solution using `pdfplumber` which is far more better utility and allows full control over pages – Raghav Gupta May 14 '21 at 17:34

score 0 · Answer 2 · answered Apr 10 '15 at 14:28

Search for a pattern that shows you have a page number or header, footer! For example when I used pdftotext to convert a pdf file to text I realized that number pages stand alone in the text so I used regular expressions to substitute them like this:

for root, dirs, files in os.walk(src, topdown=False):
    for name in files:
        if name.endswith('.txt'):
            with open(os.path.join(root, name), "r") as fin:
                 data = fin.read()    
                 new_text = re.sub(r'\n\d+\n\s','',data,re.DOTALL)

Because every page number was in a line (without any other text) and after that number I had a new line. I did the same for header and footer of the pdf file.

Remove a page number, header and footer from pdf file

2 Answers2