I'm trying to prepare data for a machine learning process. In this context I have a document from which I would like to extract the headlines with the corresponding texts and save them into an Excel spreadsheet. It is important that the texts for the headlines are in the right column, that they are ordered and that not all the data is thrown together in one table.
At the moment I have managed to read out the headlines as well as the texts. But now I don't know how to write both of them into an Excel Table.
import os
from docx import Document
import re
import xlsxwriter
import pandas as pd
from docx.shared import Inches
document = Document("/home/XXX/XXX/XXX/XXX.docx")
heading1 = []
heading2 = []
text1 = []
for paragraph in document.paragraphs:
if paragraph.style.name=='Heading 1':
heading1.append(paragraph.text)
elif paragraph.style.name=='Heading 2':
heading2.append(paragraph.text)
elif paragraph.style.name == "Normal":
text1.append(paragraph.text)
Hope you understand my problem.