10

I want to parse a pdf file, for that I am using pdftotext utility which converts pdf file into text file, now I want to remove a page number, header and footer from text file.

I am converting a pdf file using following syntax:

pdftotext -layout input.pdf output.txt

Can anyone help me on this?

eugen
  • 8,916
  • 11
  • 57
  • 65
Deepti Kakade
  • 3,053
  • 3
  • 19
  • 30

2 Answers2

14

You need crop with params -H -W -y -x, as least -H -W -y.

Example:

pdftotext -y 80 -H 650 -W 1000 -nopgbrk -eol unix example.pdf


-y 80   -> crop 80 pixels after the top of file (remove header);
-H 650  -> crop 650 pixels after the -y has cropped (remove footer);
-W 1000 -> hight value to crop nothing (need especify something);

You need adjust -y and -H to each PDF, sometimes reducing -y and increasing -H to fit with the header and footer;

Reinaldo Gil
  • 630
  • 5
  • 11
  • 4
    How to count number of pixels? – TatianaP Feb 28 '18 at 07:36
  • @TatianaP The default setting is 72 DPI (dots per inch), so you could measure in inches and multiply by 72. – Andrew Jul 13 '19 at 20:57
  • any idea how to use this if you're on windows 10? – Raghav Gupta May 13 '21 at 10:01
  • @RaghavGupta https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows – Reinaldo Gil May 14 '21 at 12:32
  • 1
    @ReinaldoGil I have checked all the links regarding that. Unfortunately the question is more focused towards downloading pdftotext which I have, and some solutions they have mentioned now doesn't work. I have found solution using `pdfplumber` which is far more better utility and allows full control over pages – Raghav Gupta May 14 '21 at 17:34
0

Search for a pattern that shows you have a page number or header, footer! For example when I used pdftotext to convert a pdf file to text I realized that number pages stand alone in the text so I used regular expressions to substitute them like this:

for root, dirs, files in os.walk(src, topdown=False):
    for name in files:
        if name.endswith('.txt'):
            with open(os.path.join(root, name), "r") as fin:
                 data = fin.read()    
                 new_text = re.sub(r'\n\d+\n\s','',data,re.DOTALL)

Because every page number was in a line (without any other text) and after that number I had a new line. I did the same for header and footer of the pdf file.

bettas
  • 195
  • 1
  • 2
  • 11