I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks