0

I have created a folder with 158 pdf files. I want to extract data of each file. Here is what I have done so far.

Importing modules

from itertools import chain
import pandas as pd
import tabulate
from tabula import read_pdf

Reading data file

data_A = read_pdf('D:\\Code\\Scraping\\DMKQ\\A.pdf', pages='all',encoding='latin1')
data_B = read_pdf('D:\\Code\\Scraping\\DMKQ\\B.pdf', pages='all',encoding='latin1')

# Generating Dataframe and print(len) for each file. 
data_A_c = chain(*[data_A[i].values for i in range(0,len(data_A))]) 
headers=chain(data_A[0])
df_A = pd.DataFrame(data_A_c,columns=headers)
df_A.set_index('Name', inplace=True)
print(len(df_A.index))

data_B_c = chain(*[data_B[i].values for i in range(0,len(data_B))]) 
headers=chain(data_B[0])
df_B = pd.DataFrame(data_B_c,columns=headers)
df_B.set_index('Name', inplace=True)
print(len(df_B.index))

At the moment, I have to copy the code and change the name for each new file respectively, which is time consuming and almost impossible to perform, given that my folder has 158 files in total. Does anybody knows how to execute this entire process more efficiently?

E_net4
  • 27,810
  • 13
  • 101
  • 139
  • If all of the files of interest are in 1 folder, you should be using `os.listdir()` to create a list of files and iterate over it. See this: https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory – AirSquid Nov 20 '22 at 16:16

0 Answers0