Ruby: parse/extract images and objects from docx file

Question

I am trying to open and read a .docx file using Ruby, and extract portions of the text and objects/images and save into another (non .docx) file.

Using Nokogiri, I am able to properly extract text and do my partitioning of the document into the sections I want via:

zip = Zip::File.open file_path
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt  = xml.root.xpath("//w:t", {"w" => 
"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})

If I do instead:

xml.root.xpath("//w:body", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"})

I can see the objects in the xml as:

  <w:object w:dxaOrig="1440" w:dyaOrig="400">
    <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
      <v:stroke joinstyle="miter"/>
      <v:formulas>
        <v:f eqn="if lineDrawn pixelLineWidth 0"/>
        <v:f eqn="sum @0 1 0"/>
        <v:f eqn="sum 0 0 @1"/>
        <v:f eqn="prod @2 1 2"/>
        <v:f eqn="prod @3 21600 pixelWidth"/>
        <v:f eqn="prod @3 21600 pixelHeight"/>
        <v:f eqn="sum @0 0 1"/>
        <v:f eqn="prod @6 1 2"/>
        <v:f eqn="prod @7 21600 pixelWidth"/>
        <v:f eqn="sum @8 21600 0"/>
        <v:f eqn="prod @7 21600 pixelHeight"/>
        <v:f eqn="sum @10 21600 0"/>
      </v:formulas>
      <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
      <o:lock v:ext="edit" aspectratio="t"/>
    </v:shapetype>
    <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:1in;height:20.4pt" o:ole="">
      <v:imagedata r:id="rId4" o:title=""/>
    </v:shape>
    <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1025" DrawAspect="Content" ObjectID="_1563800156" r:id="rId5"/>
  </w:object>

but not sure how to convert that to something that can be later used to display in html. Converting to svg such that it could be displayed along with the text in html would be ideal.

Thanks for any help.

score 0 · Answer 1 · answered Sep 18 '17 at 19:26

It looks like that might be VML, comparing it against the example from Using the Formulas Element on the MSDN:

<v:shape style='width:1in;height:1in;' strokecolor="red"
strokeweight="2pt" coordsize="21600,21600" adj="17520"
path="m10800,0qx0,10800,10800,21600,21600,10800,10800,0xe
m7340,6445qx6215,7570,7340,8695,8465,7570,7340,6445xnfe
m14260,6445qx13135,7570,14260,8695,15385,7570,14260,6445xnfe
m4960@0c8853@3,12747@3,16640@0nfe">
  <v:formulas>
    <v:f eqn="sum 33030 0 #0"/>
    <v:f eqn="prod #0 4 3"/>
    <v:f eqn="prod @0 1 3"/>
    <v:f eqn="sum @1 0 @2"/>
  </v:formulas>
</v:shape>

There's a link on that page to the VML specification to look through what everything is, but otherwise as for pre-written tools I'm not finding very much, though there are a couple of questions about it, most of them look like they get marked duplicates of, or in some other way all refer back to, the question Are there any tools to convert legacy VML to SVG?

Ruby: parse/extract images and objects from docx file

1 Answers1