I am reading the input docx file sections/paragraphs and then copy-pasting the content in it to another docx file at a particular section. The content is having images, tables and bullet points in between the data. However, I'm getting only text not the images, tables and bullet points present in between the text.
Tika module is able to read whole content but the whole docx is coming in a single string so I'm unable to iterate over the section and also I'm unable to edit(copy-pasting the content) the output docx file.
Tried using python-docx, whereas it reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph
Tried unzipping word to XML, but the XML is having images in a separate folder. Also, the code will not identify the bullets
def tika_extract_data(input_file, output_file):
import tika, collections
from tika import unpack
parsed = collections.OrderedDict()
parsed = unpack.from_file(input_file)
with open(output_file, 'w') as f:
for line in parsed:
if line == 'content':
lines = parsed[line]
# print(lines)
for indx, j in enumerate(lines.split("\\n")):
print(j)
I expected the output file should be having all the sections replaced with the copied input section content(images, tables, smart art and formats)
The output file just has the text data.