0

Problem

My problem is that I want to extract tables from several PDFs. I can get the data out. Only the writing into a csv file does not work.

I get this out: as it should not be

How I want it to look How it should look

i am Importing pdfminer, os and pandas

My Code

path='My_Path'
df_results = pd.DataFrame()
for file_name in os.listdir(path): #Loop on Files
    print(file_name)
    fp = open(path + file_name, 'rb')
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    pages = PDFPage.get_pages(fp)
    
    for page in pages:
        print('Processing next page...')
        interpreter.process_page(page)
        layout = device.get_result()
       

        for lobj in layout:
            if isinstance(lobj, LTTextBox):               
                x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
              
                #print('At %r is text: %s' % ((x, y), text))
                #data= pd.Series(text)
                
                if x==50.520000749999994 and y==200.30424779999996: #x and y from console from print from line 39
                    collected_data = [text]
                    data_list = collected_data
                    #data = pd.Series(data_list)
                    print(data_list)
                    data= pd.DataFrame([data_list], columns=list('c'), )
                    df_results = df_results.append(data,ignore_index=False)                
                if x==405.599991 and y==187.82423730000002: #x and y from console from print from line 39
                    collected_data = [text]
                    data_list = collected_data
                    #data = pd.Series(data_list)
                    print(data_list)
                    data= pd.DataFrame([data_list], columns=list('d'), )
                    df_results = df_results.append(data,ignore_index=False)                
                if x==562.4399872500001 and y==187.82423730000002: #x and y from console from print from line 39
                    collected_data = [text]
                    data_list = collected_data
                    #data = pd.Series(data_list)
                    print(data_list)
                    data= pd.DataFrame([data_list], columns=list('f'), )
                    df_results = df_results.append(data,ignore_index=False)
                      
                    #print(collected_data)
print(df_results)
df_results.to_csv('coordinates_data.csv', index = False, sep=';', )

1 Answers1

0

You don't say what the output of print(df_results) at the end is, but if you look at it you will find that it prints out a data frame with many rows, each having only one column in them. This is why your output is not structured as you want it. The problem is nothing to do with formatting to CSV, and everything to do with getting the correct contents into the Pandas data frame to start with. Once you have the correct contents in the data frame, you should have no trouble saving it to CSV.

Your problem is that the PDF is structured so that what are visually table columns are actually independent text boxes. You process them one at a time, so that you get the data one column at a time, not row at a time, which is the way that Pandas Data Frames are designed to be used.

When faced with similar problems, I have found it easier to assemble all the data in native python data types (lists, dicts) first, then transform into a data frame at the end. In this example I would get your three lists from the three text boxes (which each have the values for one column), and then combine them. Either use zip() to iterate all three lists in parallel (row by row in the table) or use some sort of list comprehension. The aim is to get a list of lists with the correct structure that Pandas will then construct a data frame from in a single operation, giving the correct data frame.

P.S. It's easy to transpose a data frame if you end up with rows and columns swapped!

  • This is the Output of print(df_results): c ... f 0 Ausgewertetes Element\nEbenheit Ø35,5\nEbenhei... ... NaN 0 NaN ... NaN 0 NaN ... 7.87! \n1.64\n9.90! \n7.87! \n1.64\n9.75\n4.18... 0 Ausgewertetes Element\nEbenheit Ø35,5\nEbenhei... ... NaN ... – Jonathan RZ Dec 01 '21 at 10:42
  • You can use ``code`` tags to put pre-formatted text into comments - the above one doesn't make much sense as shown. There are also ways to persuade printing a data frame not to use ellipsis (`...`) and actually print the full output, which may be more informative: https://stackoverflow.com/questions/19124601/pretty-print-an-entire-pandas-series-dataframe – RichardAshAudacity Dec 06 '21 at 19:16