0

I have a multi-page Docx document and I need to write a code that splits it by pages, precisely as it appears when the document is opened in "Word", (in the same way that saving in PDF format divides the pages, although I must keep the documents in Docx format and remain the same as the pages in the original document, so Converting PDF files back to DOCX does not give a good enough solution) and finally save each page as a separate Docx file, so if I need the "k" page I can access the "k" document that contains the exact "k" page of the original document. I basically work with python but the problem has been so challenging that I am willing to hear any idea, including in other languages.

As mentioned, I tried conversion to PDF, split and conversion back to Docx but the conversions affected the content of many pages since these are documents that contain, in addition to plain text also many tables in a variety of designs.

I also found a code in the VBA that was supposed to provide a solution to the problem, but in practice the split it performed was not consistent enough and in some documents it created blank pages and also occasionally split a single page into several pages.

Gilad_Tz
  • 25
  • 7
  • Please do not tag-spam. "Chose your weapon". – Fildor May 05 '21 at 08:04
  • Does this help: [How to write separate DOCX files by page from one DOCX file?](https://stackoverflow.com/questions/59993669/how-to-write-separate-docx-files-by-page-from-one-docx-file) ? – Fildor May 05 '21 at 08:09
  • Thanks so much for the comment! First of all sorry for the spam tagging. I did it because I'm willing to change my "weapon" according to the correctness of the solution, so I tagged some of the options that were reasonable to me, but I deleted them. As for the link you added, it doesn't solve my problem because in my case each page has a different structure, and in any case a solution of converting to PDF is not relevant, as well as using a non-free API. thank you!! – Gilad_Tz May 05 '21 at 08:44
  • @macropod The problem with these documents is that they are ill structured in the sense that every page has a different number of paragraphs, tables (some has none) and not every page contains footer/ header, but some has.. this is why I gave the conversion to pdf as an example - it generally performs great without consistency inside the document, and the thing I need is the split of the visually separated pages, and not necessarily by the sections/ paragraphs etc. The bigger picture is that I need to extract data from specific pages and I don't know how to approach them without splitting. – Gilad_Tz May 05 '21 at 09:21
  • _"The bigger picture is that I need to extract data from specific pages"_ - Run, don't walk. That project is doomed. – Fildor May 05 '21 at 11:20
  • @Fildor LOL Thanks for the encouragement mate! I manage to perform the data extraction operation, the problem is that I know how to extract the data from the entire document and then manually access the data of the relevant page. If I had the ability to save each page as a separate file I could extract the data from that specific file. Is this such an unsolvable challenge? – Gilad_Tz May 05 '21 at 12:25
  • Ok, crazy Idea ... but hear me out: How about _printing_ the document? That _should_ give you repeatable results on what goes to what page... and you don't _have_ to print to _paper_ ... – Fildor May 05 '21 at 12:28
  • Outside that ... it's really harder than it sounds. If you don't have specific markers, and have to go by a "measure" that is basically not native to a word document ... (by which I mean: apart from explicit page breaks, pages will only be determined by print ... ). Yes, so my best idea is "Print page N as Word 'name-N.doc'" -- and good luck! – Fildor May 05 '21 at 12:32
  • Quite apart from anything else, because Word uses data from the current printer driver to optimise the page layout, what will fit on one page on your system may occupy less than that - or more - on another system. So doing page splits of the kind you want are liable to give unpredictable results. – macropod May 05 '21 at 13:05
  • If I am not mistaken, printing converts the document to pdf, or at least saving the printable file converts it to pdf, which leads me to the problem that extracting data from pdf is much less robust than from Docx. However, the very fact that the conversion to pdf succeeds so accurately shows that there is a possibility of an accurate split of the pages, just as Word presents them. The question is how ...? – Gilad_Tz May 05 '21 at 13:42

0 Answers0