0

I am trying to convert a docx into a pdf file from an ASP.NET MVC application. I have been using Microsoft interop saveas command til now but it sometimes (not always) fails with the error "command failed". I have seen that it is already deprecated and not supported by Microsoft anymore and Microsoft says it is not recommended to use it anymore from an ASP.NET application so I am trying to get alternatives.

I have seen there is a good one, that is, aspose.words but it is not free. I am interested in a free one. So nowadays is there any free alternative out there that is compatible with Microsoft docx documents and capable to convert into pdf without problems?

Willy
  • 9,848
  • 22
  • 141
  • 284
  • This isn't a question about the Visual Studio application, so I've removed the `[visual-studio-2013]` tag. If you're limited to a specific version of the .NET Framework, please tag that version instead (since this relates to code, whereas the VS version doesn't). – ProgrammingLlama Oct 21 '20 at 07:27
  • Your real problem is `PDF` not `docx`. `docx` is a ZIP package containing XML files in a well-defined format. It's PDF that's the real problem, as it's essentially a container for print commands (PostScript), not a document format. Even if you wanted to convert HTML to PDF you'd run into trouble – Panagiotis Kanavos Oct 21 '20 at 07:56
  • @PanagiotisKanavos HTML-to-PDF thesedays requires firing up an entire instance of Chrome in headless mode using Puppeteer, fortunately "printing" to PDF is straightforward once you get to that part: https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/ – Dai Oct 21 '20 at 08:07
  • @Dai or using a service in Java to make the conversion like iText ... which is AGPL - oops. `pandoc` and `Process.Start` which doesn't scale well for enterprise apps. The for-profit kind of apps that should pay the $500 license in the first place ... – Panagiotis Kanavos Oct 21 '20 at 08:10

1 Answers1

1

I am interested in a free one

There isn't one. Office/Word's .docx file format is incredibly long and complicated (see below), so writing a program that can fully parse a Word document alone is a mammoth undertaking, alone the as-important task of generating a visual-formatting model representation, and then convert that visual model to a PDF file by generating PostScript/PDF commands from it.

This is what OOXML specification looks like when it's printed out:

enter image description here

(Source: https://fussnotes.typepad.com/plexnex/2007/05/ooxml_more_than_1.html )

Then consider all the features and edge-cases present in the Word formatting model: tables, headings, drop-caps, captions, (don't forget embedded and external content using OLE!), floating textboxes, WordArt, and so on.

Non-visual processing of the XML representation of a Word document is actually trivial and can be done with any XML library - though you should use an OOXML-schema-aware library so you process the Word document correctly (so you don't end-up inserting a paragraph into a header, or a caption that fills the page).

Everything else is the difficult (and expensive) part of the problem. This is why, even today, almost 40 years after Word was first released and 15 years after the OOXML format specification was released, third-party software like OpenOffice (nee StarOffice) and Apple iWork still cannot fully and correctly import or render Word documents.

Dai
  • 141,631
  • 28
  • 261
  • 374
  • ... and I suppose pay alternatives do not guarantee a full compatible conversion into pdf, right? – Willy Oct 21 '20 at 07:30
  • @Ralph What do you mean by "full compatible"? – Dai Oct 21 '20 at 07:31
  • 1
    The expensive part is *PDF*, not `docx`. There's no good, free PDF library. `docx` is a *lot* easier than PDF, which is essentially a print language. `docx` is a ZIP file containing well defined XML files. PDF doesn't even have tables on the other hand. You can read a `docx` file with the OpenXML SDK if you want. There's not much interest in generating `docx` files though which is why there are no or very few libraries to make this easier, the way EPPlus, ClosedXML or NPOI do for `xlsx`. Which follows the same format – Panagiotis Kanavos Oct 21 '20 at 07:37
  • I mean maybe docx document contains some type of object that third party cannot face with it and then document cannot be converted into pdf file successfully. – Willy Oct 21 '20 at 07:37
  • 1
    @Ralph it's thje other way around. You can read a `docx` just fine with the OpenXML SDK. Word processing is not Excel sheets though and contains a *lot* of different object types like paragraphs, runs, characters, styles etc. The problem is *PDF*. You can use `iTextSharp` up to a point but PDF is essentially a print language (PostScript), not a document format. It has no tables. Have you tried selecting table rows in a PDF viewer? Noticed how the selection may go along columns instead of rows? Or how selecting text can select unrelated paragraphs? There are no paragraphs either – Panagiotis Kanavos Oct 21 '20 at 07:40
  • Even if you wanted to convert HTML to PDF you'd have the *same* problem - how to generate the PDF content from HTML? Check this [related question](https://stackoverflow.com/questions/10641667/use-of-xsl-fo-css3-instead-of-css2-to-create-paginated-documents-like-pdf/21345708#21345708) to see that in 2020 there's still not a good answer – Panagiotis Kanavos Oct 21 '20 at 07:43
  • @PanagiotisKanavos ok, perfect explanation. A lot of thanks for the detailed explanation. Great! – Willy Oct 21 '20 at 07:53
  • @PanagiotisKanavos I disagree that the PDF part is the hard part - I believe the hardest part is creating a model of the visual/printed representation of a Word document, because doing that involves *having to reimplement all of Word's formatting model* (right down to rasterizing WordArt). Once you have a model of a visual representation of a document then converting that to PDF (or XPS, or a raster format like BMP or PNG) is straightforward. – Dai Oct 21 '20 at 08:11
  • @Dai using which .NET library? That doesn't require a license? – Panagiotis Kanavos Oct 21 '20 at 08:12
  • @PanagiotisKanavos You wouldn't need one: the PDF specification is simple enough to implement using only a `BinaryWriter`: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf - people buy those PDF libraries for the higher-level functionality (namely converting from one representation to another). If you have your own visual-formatting/layout model of a Word document then writing out the necessary PS drawing commands to create a PDF file is straightforward. – Dai Oct 21 '20 at 08:16