How to put docx content in dataframe columns?

Question

Below is my code:

if t.endswith('.docx'):
        def get_files(extension, location):
            v_doc = []
            for root, dirs, files in os.walk(location):
                for t in files:
                    if t.endswith(extension):   
                        v_doc.append(t)
            return v_doc
        
        file_list = get_files('.docx', paths)
        #print(file_list)
        index = 0
        for file in file_list:
                index += 1
                doc = Document(file)
                #print(doc)
                column_label = f'column{index}'
                data_content = doc.paragraphs
                final = []
                for f in data_content:
                    final.append(f.text)
                    new = [x for x in final if x]
                    #j = {column_label: new}
                    #print(j)
                    df_last = pd.DataFrame(new, columns= 
                                              [column_label])
                    df_last.to_excel('output_dummy.xlsx')

But i get following problem:

column2:
#hello how are you guys?
#i hope you are all doing fine

expected dataframe output:

column1:                                                 column2:
#This column is getting replaced by column 2             #hello how are you guys?

#some random dummy text                                  #i hope you are all doing fine

docx1 contans: #This column is getting replaced by column 2 #some random dummy text

docx2 conatins: #hello how are you guys? #i hope you are all doing fine

i know its a silly question. where am i doing this mistake ?

Please provide an example docx and full [MRE](https://stackoverflow.com/help/minimal-reproducible-example). Also, what is the library you use to open *.docx? — sophros, Jul 13 '21 at 08:38
Hey thanks for replying, i have solved this question already, its old. but can you please check my new question here ? https://stackoverflow.com/questions/68413792/how-to-sort-dataframe2-according-to-dataframe1-with-fuzzywuzzy — Titan, Jul 17 '21 at 05:27

score 0 · Answer 1 · answered Jul 17 '21 at 05:35

I found the answer.

Repeat f'column{index}' also for .doc and .excel to

f'column{index+index2}'.

#index2 is for docx or excel like previous one.
for file2 in file_list2:
            file2 = 'datas/'+file2
            index2 += 1
            column_label2 = f'seller{index2}'
            df = pd.read_excel(file2, header=None, index_col=False)
            for l in df.values:
                for s in l:            
                    g.append(s)
                    
                    
        t = [incom for incom in g if str(incom) != 'nan']            
        for s in t:
            final.append({column_label2: s})
            
        index = 0    
        for file in file_list:
            file = 'datas/'+file
            index += 1
            doc = Document(file)
            column_label = f'seller{index+index2}'
            for table in doc.tables:
                for row in table.rows:
                    for cell in row.cells:
                        new_list = [p.text for p in cell.paragraphs if p.text not in ['5','3','0.1%', '1%','1',
                                                                                    'Bill','Number' ]]
                        for s in new_list:
                            final.append({column_label: s})
                                
            y = [d.text for d in doc.paragraphs if d.text not in ['5','3','0.1%', '1%', '1',
                                                                  'Number']]
            for k in y:
                final.append({column_label: k})

How to put docx content in dataframe columns?

1 Answers1