Search for a pattern that shows you have a page number or header, footer! For example when I used pdftotext to convert a pdf file to text I realized that number pages stand alone in the text so I used regular expressions to substitute them like this:
for root, dirs, files in os.walk(src, topdown=False):
for name in files:
if name.endswith('.txt'):
with open(os.path.join(root, name), "r") as fin:
data = fin.read()
new_text = re.sub(r'\n\d+\n\s','',data,re.DOTALL)
Because every page number was in a line (without any other text) and after that number I had a new line. I did the same for header and footer of the pdf file.