how to put all docx data into separate dataframe columns in python

Question

I didnt find anything for this question in stackoverflow, so please be patient with me and i didnt get any idea to work this out, please bear with me.

Below is my code:

v_doc 

for root, dirs, files in os.walk(paths):
    for t in files:
        if t.endswith('.xlsx'):   
            v_doc.append(Document(t))

            # say like, there are 3 docx which contains simple sentences. how to put 
            #those sentences into seperate dataframe columns for each docx sentences ? i have many docx. n number of docx

example docx:

docx1 contains:

Hello guys how are you all, hope you guys doing good.

docx2 contains:

I dont know what to write here

docx3 contains:

We are strong together ! do we ?

expected output:

dataframe:
column1                                                 column2
#Hello guys how are you all, hope you guys doing good.  #I don't know what to write here
column3
#We are strong together ! do we ?

hope i get some response. Thank you in advance.

This is not a minimal reproducible code snippet--try to make it reproducible — duhaime, Jul 11 '21 at 18:43

score 1 · Accepted Answer · answered Jul 11 '21 at 19:33

1

Gotchya:

import os
import docx

dataframe = {}

def get_files(extension, location):
    v_doc = []

    for root, dirs, files in os.walk(location):
        for t in files:
            if t.endswith(extension):   
                v_doc.append(t)
    return v_doc

file_list = get_files('.docx', '.')
index = 0
for file in file_list:
    index += 1
    doc = docx.Document(file)
    column_label = f'column{index}'
    data_content = doc.paragraphs[0].text
    dataframe = {column_label: data_content}

print(dataframe)

answered Jul 11 '21 at 19:33

Bilal Qandeel

727
3
6

{'column1': 'contents of example1.docx', 'column2': 'contents of example2.docx', 'column3': 'contents of example3.docx'} – Bilal Qandeel Jul 11 '21 at 20:02
doc.paragraphs[0].text showing nothing. but for x in data_content: print(x.text) – Titan Jul 11 '21 at 20:04
It is supposed to grab the `title` only. i.e. the very first paragraph. If it is left blank then so it will too. – Bilal Qandeel Jul 11 '21 at 20:06
oh i see but is it possible to grab all things in docx1, docx2 and puting in column1 and column2 of dataframe ? – Titan Jul 11 '21 at 20:08
doc.paragraphs[0].text shows nothing dude, but doc.paragraphs.text after iteration shows the content – Titan Jul 11 '21 at 20:10
if you need all the contents inside of the `docx` , then join the `paragraphs` all together with two new lines: (one to start a new line and another to start a new paragraph) that can be achieved using `join` like this `doc.paragraphs.join('\n\n')` – Bilal Qandeel Jul 11 '21 at 20:12
AttributeError: 'Paragraph' object has no attribute 'join'. ... – Titan Jul 11 '21 at 20:17
can you please update working code in your main code ? thank you for trying to help, really appreciate. – Titan Jul 11 '21 at 20:19
1

Working awesome !!!! changed little bit of code. MANY MANY THANKS ! – Titan Jul 11 '21 at 20:28
Hey, i tried this dataframe = pd.DataFrame(data_content , columns=[column_label]) and the dataframe only showing column2 but not column1 can you help ? – Titan Jul 11 '21 at 21:29
Of course, it did not. You have just squeezed all the `data_content` into a single column named by the value of `column_label`. `dataframe` is already of the data type `dataframe`. I see no value recasting it using `pd.DataFrame`. i.e. use `dataframe['some_nice_column']` as it is wherever needed. – Bilal Qandeel Jul 11 '21 at 21:42
yeah i understand but dataframe.to_excel is very easy one. can you please tell how to export that {column_label: data_content} as excel.xlsx – Titan Jul 11 '21 at 21:51
Do I get "Best Answer" XD XD? 1. `df = dataframe['some_nice_column']` and 2. `df.to_excel("output.xlsx")` – Bilal Qandeel Jul 11 '21 at 22:08
Hey buddy i understand but the columns should be 'n' number of columns because the docx file are not just two files but n no of files in my case. – Titan Jul 11 '21 at 22:16
hey i gave your response as best answer no doubt in that !!! – Titan Jul 12 '21 at 06:35
can you please help with this query ? i will give 50+ reputation if answered. https://stackoverflow.com/questions/68413792/how-to-sort-dataframe2-according-to-dataframe1-with-fuzzywuzzy – Titan Jul 17 '21 at 15:02
You got it... https://stackoverflow.com/questions/68413792/how-to-sort-dataframe2-according-to-dataframe1-with-fuzzywuzzy/68425166 – Bilal Qandeel Jul 18 '21 at 00:27

how to put all docx data into separate dataframe columns in python

1 Answers1