2

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)

I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.

Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:

import re
import sys

recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []

for line in open(your_file).readlines():
    if recording is False:
        if re.search(start_pattern, line) is not None:
            recording = True
            output_section.append(line.strip())
    elif recording is True:
        if re.search(stop_pattern, line) is not None:
            recording = False
            sys.exit()
        output_section.append(line.strip())

print("".join(output_section))

As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".

what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)

Thanks

  • https://automatetheboringstuff.com/chapter13/ – Klemen Tusar Aug 26 '19 at 14:00
  • Hi, thanks for the answer. I extracted full pages but what i can't do is to get specific parts of those pages by specific patterns as i written. For example, extracting name and lastname. – noesunfelioe Aug 26 '19 at 14:22
  • I am not sure if this link is a duplicate or just something that will get you a very long step of the way along, but try https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text to get the pdf to text. This will avoid all the "noise" in the pdf like drawing a rectangle and coloring the background of it where the name goes. Once the noise is gone, you'll find data mining a lot easier. – Scott Mermelstein Aug 26 '19 at 16:59
  • Thanks Scott. I've actually do that. I'm stucked in trying to get just a part of the text. i mean, get the text between two words or numbers. – noesunfelioe Aug 26 '19 at 18:11
  • Could you [edit] a little bit of your text file into your question? Maybe 2 lines before the significant lines, up to 2 lines after? – Scott Mermelstein Aug 26 '19 at 19:45
  • I'm not sure. i mean, the idea is to automatize the whole process because i've to repeat this with 600 diferent pdf. – noesunfelioe Aug 26 '19 at 21:46

0 Answers0