I am trying to iterate through a number of PDF files within a folder on my desktop. My goal is to read the text from each of these PDFs (they are all only one page long) and place each distinct PDF's text into a new row within one dataframe.
I have tried looping through the folder, and it has worked in terms of providing me with text outputs from all the PDFs I have in that folder (I have created a folder with two "test" PDFs to see if the code works), but it fails to concatenate the text into one single dataframe. I would like for the output of my code to create a single dataframe with new rows containing each PDF's text so that I can export it to a csv afterward. The output I am getting is instead two separate dataframes that, once I export to a csv, do not transfer their text into the csv file. In fact, the code I have written I believe overwrites every dataframe except for the last one created, thus producing only one object called "df". Any help would be greatly appreciated, hope this query was clear enough, I have seen related threads but have not been able to find one that solves this exact issue.
rootdir = 'directory file path'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
doc = fitz.open(file)
page = doc[0]
text = page.getText("text")
text_list = [] #create list to store text in
text_list.append(text) # append the text to the list
df = pd.DataFrame(text_list) #create a df from the list
df.columns = ['text']
doc.close()
print(df)
Output is below:
text
0 Dummy PDF file\n
text
0 \n \n \n \n \n \nThis is a test PDF document....