1

I have quite a large collection of DOCX documents, and I need to delete all but the first page in all of them. From what I have read, docx-python does not support this since it has no notion of pages. One option I have considered is converting to PDF, deleting the pages, and converting back to DOCX, but I am concerned this will break the formatting somewhat not to mention probably be slow for so many documents. What is my best option here?

Something like:

for page in pages[1:]:
    del page
kjhughes
  • 106,133
  • 27
  • 181
  • 240
AmanKP
  • 95
  • 8

2 Answers2

3

You cannot delete particular pages from a DOCX file at the data level alone because you cannot even reliably reference pages at the data level.

You'll have to change your access model away from depending upon pagination, or hack a solution based on Word Automation with its licensing and server operation limitations. Moving to a non-page-based reference model such as one based on paragraphs or sections is your best option. Such models are more compatible with modern content management requirements across devices with widely varying display sizes anyway.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you for your answer. I have looked far and wide but it's clear that what I'm trying is not a supported use case. I have given up on the pure docx approach and I am currently trying to hack my way through a docx->pdf->delete->docx pipeline. It's abhorrently slow but it is what it is. – AmanKP Apr 02 '23 at 13:06
0

Okay so with some help from libreoffice forum members I have a solution: a macro. It's relatively slow but it is what it is. Note that this deletes all non-first pages but you can with some work rewrite this to select a particular page or a range of pages.

note: Warning to future readers: Good if this approximation works for you, but you should realize that there's no guarantee that LIbreOffice's pagination algorithm will match that of Microsoft Word's, so users who use Word may see different deletions. As such, you probably don't want to use this in a production pipeline, and for one-offs, you might be better off using Word Automation to get results closer to what most user would be seeing as a "page". Bottom line: Any design dependent upon DOCX "pages" at the document data level alone is intrinsically flawed. – user @kjhughes

Macro:

  Dim doc, cursor
  Dim props2(0) As New com.sun.star.beans.PropertyValue
  Dim props(0) As New com.sun.star.beans.PropertyValue
  props(0).Name="Hidden"
  props(0).Value=True

  For i = start To end_-1
    doc = StarDesktop.LoadComponentFromUrl("file:///path_to_your_document_folder/" + subdir + "/doc" + i + ".docx", "_default", 0, props)
    cursor = doc.CurrentController.getViewCursor()
    cursor.gotoStart(false)
    If cursor.jumpToNextPage() Then
      cursor.gotoEnd(true)
      cursor.setString("")
    End If

  doc.store(props1)
  doc.close(true)

  Next i

End Sub

soffice command through python:

    clip_cmd = 'soffice --nologo --nofirststartwizard --norestore'
            f' "macro:///Standard.Module1.del(0, 1000, <subdir_name>)"'

        a = time.time()
        print(f"clipping subdir <subdir_name>.")
        sp.call(clip_cmd, shell=True, stdout=null)
        print(f"This batch took {time.time() - a} seconds.")

Of course, make sure the del macro is saved to your libreoffice user.

AmanKP
  • 95
  • 8
  • 1
    The very question is wrong. `docx` is a document format, just like HTML. How many "pages" does this question have? When you view it in your phone? If you print to a different printer you get different pages. On the other hand, if you have a title page because eg, there's an explicitly entered page separator, you *can* delete everything up to that separator. If the file has sections, you can delete the first section. – Panagiotis Kanavos May 18 '23 at 11:38
  • `docx` is a ZIP file containing XML files with a well defined format. Worst case, you can open the relevant XML and delete the XML element you want. It's far easier to use a library though to delete a section – Panagiotis Kanavos May 18 '23 at 11:40
  • @PanagiotisKanavos every bit of information about the indents, page w and h, etc. are all included in the xml of a correctly prepared docx. Knowing these, a never-before-opened docx will still generate the exact same page breaks universally. Which is why when you do jumpToNextPage it calculates correctly the cursor location for the next page's content. At any rate since there isn't any "correct" way to achieve this given the lack of true page-ness of docx files, an approximation like this is as good as you'll get. And personally this has performed correctly for over a million docxs for me. – AmanKP May 18 '23 at 13:45
  • @KJ what I said will hold at least for measurements in pts. I haven't considered whether changing units would have the effect of preserving the quantity. If the page_height is 12000 isn't that standardized pts? End of the day, as I said in my previous reply this was a lossless approximation to inherent pagination in docx where there doesn't exist any. Whatever your own os would render as a page translates directly to what this macro believes is a page. – AmanKP May 19 '23 at 03:11
  • **Warning to future readers:** Good if this *approximation* works for you, but you should realize that there's no guarantee that LIbreOffice's pagination algorithm will match that of Microsoft Word's, so users who use Word may see different deletions. As such, you probably don't want to use this in a production pipeline, and for one-offs, you might be better off using Word Automation to get results closer to what most user would be seeing as a "page". ***Bottom line: Any design dependent upon DOCX "pages" at the document data level alone is intrinsically flawed.*** – kjhughes May 19 '23 at 13:33
  • @kjhughes I have updated the answer with your warning, thank you. – AmanKP May 19 '23 at 14:10