69

When converting .docx file to markdown, the embedded image is not extracted from the docx archive, yet the output contains ![](media/image1.png){width="6.291666666666667in" height="3.1083333333333334in"}

Is there a parameter that needs to be set in order to get the embedded pictures extracted?

JC-
  • 1,200
  • 2
  • 9
  • 12

3 Answers3

119
pandoc --extract-media ./myMediaFolder input.docx -o output.md

From the manual:

--extract-media=DIR Extract images and other media contained in or linked from the source document to the path DIR, creating it if necessary, and adjust the images references in the document so they point to the extracted files. Media are downloaded, read from the file system, or extracted from a binary container (e.g. docx), as needed. The original file paths are used if they are relative paths not containing ... Otherwise filenames are constructed from the SHA1 hash of the contents.

mb21
  • 34,845
  • 8
  • 116
  • 142
  • 9
    Thank you for the answer, it works for me. I notice in Pandoc 2.2.1, the `--extract-media=` option creates a `media` subfolder in the path specified. If you use `--extractmedia=./media` the exported images will be found in the `./media/media` folder. – gridtrak Jun 04 '18 at 16:43
  • 7
    That's because the media files actually live in a folder called "media" inside the docx https://github.com/jgm/pandoc/issues/1986 – Viktor Mar 05 '20 at 08:26
  • There is a way to extract in the order that are presented in the input file? I mean I would like a that the media extracted would have sequential file names according to the order that they are presented in the input file. – Adolfo Correa Oct 08 '20 at 14:14
25

Referring to the comment by gridtrak and the problem of an unnecessarily deep directory strucutre (e.g. media/media/image2.jpeg), use the current directory as path DIR, then a folder media is created within the current directory (e.g. media/image2.jpeg):

pandoc --extract-media=. input.docx -o output.md
sgrubsmyon
  • 1,077
  • 14
  • 22
  • `media/` is called `Pictures/` in pandoc 2.2.3.2 on Mac. – hobs Jul 12 '19 at 17:57
  • 1
    It would be nice to have an option to rename the sub-folder or to flatten the hierarchy. Also an option to prefix image names with some pattern, to avoid name collisions in case you convert many word documents in the same folder. – Paul Rougieux Jan 14 '21 at 17:19
  • 1
    I find the discussion at the pandoc GitHub repo issue, () . The temporal solution is to manually change the folder name using `mv`, like `pandoc my.docx --extract-media=DIR && mv DIR/media DIR/img ` – Jiaxiang Aug 06 '21 at 05:47
0

You may try "--embed-resources".
https://pandoc.org/MANUAL.html#option--embed-resources[

Blockquote --embed-resources[=true|false] Produce a standalone HTML file with no external dependencies, using data: URIs to incorporate the contents of linked scripts, stylesheets, images, and videos. The resulting file should be “self-contained,” in the sense that it needs no external files and no net access to be displayed properly by a browser. This option works only with HTML output formats, including html4, html5, html+lhs, html5+lhs, s5, slidy, slideous, dzslides, and revealjs. Scripts, images, and stylesheets at absolute URLs will be downloaded; those at relative URLs will be sought relative to the working directory (if the first source file is local) or relative to the base URL (if the first source file is remote). Elements with the attribute data-external="1" will be left alone; the documents they link to will not be incorporated in the document. Limitation: resources that are loaded dynamically through JavaScript cannot be incorporated; as a result, fonts may be missing when --mathjax is used, and some advanced features (e.g. zoom or speaker notes) may not work in an offline “self-contained” reveal.js slide show.