0

I am trying to iterate through a number of PDF files within a folder on my desktop. My goal is to read the text from each of these PDFs (they are all only one page long) and place each distinct PDF's text into a new row within one dataframe.

I have tried looping through the folder, and it has worked in terms of providing me with text outputs from all the PDFs I have in that folder (I have created a folder with two "test" PDFs to see if the code works), but it fails to concatenate the text into one single dataframe. I would like for the output of my code to create a single dataframe with new rows containing each PDF's text so that I can export it to a csv afterward. The output I am getting is instead two separate dataframes that, once I export to a csv, do not transfer their text into the csv file. In fact, the code I have written I believe overwrites every dataframe except for the last one created, thus producing only one object called "df". Any help would be greatly appreciated, hope this query was clear enough, I have seen related threads but have not been able to find one that solves this exact issue.

rootdir = 'directory file path'
for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            doc = fitz.open(file)
            page = doc[0]
            text = page.getText("text")

            text_list = []                    #create list to store text in

            text_list.append(text)            # append the text to the list
            df = pd.DataFrame(text_list)      #create a df from the list
            df.columns = ['text']

            doc.close()

            print(df)

Output is below:

         text
0  Dummy PDF file\n
                                                text
0   \n \n \n \n \n \nThis is a test PDF document....
ComplicatedPhenomenon
  • 4,055
  • 2
  • 18
  • 45
Sam Cannon
  • 27
  • 1
  • 6
  • 2
    create `text_list` before `for`-loop to get all values in one list, and create `DataFrame` after `for`-loop to convert this one list into one dataframe – furas Aug 01 '19 at 02:55
  • Wow, that solved it. thanks furas, you just saved me big time!!! – Sam Cannon Aug 01 '19 at 03:04
  • Possible duplicate of [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) – wwii Aug 01 '19 at 03:11

1 Answers1

1

Although the question is quite old, let me answer to help someone in case they have similar issue.

I believe overwrites every dataframe except for the last one created

it's because you overwrite the object (df) and list (text_list) in every iteration. for example :

  • df (result of iteration 1) = df(result of iteration 2)
  • df (result of iteration 2) = df(result of iteration 3)
  • df (result of iteration 3) = df(result of iteration 4)

and so on until df only contains with last iteration, here i'm fix your code:

rootdir = 'directory file path'
text_list = [] #create list to store text in

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        doc = fitz.open(file)
        page = doc[0]
        text = page.getText("text")

    text_list.append(text) # append the text to the list
    doc.close()

#create a df from the list and specified the column at once
df = pd.DataFrame(text_list, columns=['text']) 
print(df)
Hanif Han
  • 11
  • 1
  • 2