0

I am currently working on parsing docx files in perl, trying to extract particular lines for usage.

Using Win32::OLE and regular expressions, I'm able to grab these exact lines. However, almost certainly due to an encoding issue, certain information is lost. Take the following line:

XX XXXXX-X:XXXX/XXX:XXXX

In the output txt file, this shows as: XX XXXX�X:XXXX/XXX:XXXX

As you can see, the hyphen has been replaced with an invalid character.

I'm not sure how to prevent this.

Here's how I'm getting all the text from the document:

use 5.010;
use strict;
use warnings;
use utf8;
use Win32::OLE qw(in);
use Win32::OLE::Variant;
use Win32::OLE::Enum;


use Class::CSV;

my $path = 'example.docx';
my $document = Win32::OLE -> GetObject($path);
my $outputfile = 'wordoutput.txt';
open(my $fh, '>', $outputfile) or die "couldn't open file";
binmode($fh, ":utf8");

print "Extracting Text ...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);
my $style = ''; 
my $text = '';
my $overalltext = '';

while(defined($paragraph = $enumerate->Next()))
    {
    $style = $paragraph->{Style}->{NameLocal};
    print $fh "$style";
    $overalltext = $overalltext . "$style";
    $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print $fh "$text";
    $overalltext = $overalltext . "$text";
    }

All research I've done indicates this is an encoding issue (see: any stack overflow question regarding �, such as: Why does a diamond with a questionmark in it � appear in my HTML?), but I'm not sure how to go about fixing it. Thanks in advance for your help.

EDIT: With the help of tripleee (see comments) I've taken a look at the bytes, it seems the issue in this case specifically is a non-breaking hyphen. (and earlier in the same bit, a nonbreaking space). I'm still not exactly sure how to resolve this.

Yvain
  • 261
  • 2
  • 11
  • 1
    What are the actual _bytes_ in the input? – tripleee Mar 14 '22 at 12:43
  • @tripleee I'm sorry, I don't understand what you mean by this. The input is a docx file, perhaps can you elaborate so I can help you understand? – Yvain Mar 14 '22 at 12:46
  • Text is encoded as bytes according to a specific encoding; perhaps see [the Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info) for some basics. – tripleee Mar 14 '22 at 12:49
  • @tripleee Well, my understanding is, since docx is just compressed xml, I've opened up the document.xml and I see encoding="UTF-8", so I presume this is what you're looking for? – Yvain Mar 14 '22 at 12:57
  • No, I am looking for the actual bytes in the string. Something like `$sample = $text; $sample =~ s/./ sprintf("0x%02x", ord($&)) /ges; print($sample)'` (there's probably a more elegant way to do it; my Perl is rusty) for the problematic string. – tripleee Mar 14 '22 at 13:27
  • @tripleee Do you need to see the entire string or only the bytes where I'm getting the above issue? – Yvain Mar 14 '22 at 13:43
  • @tripleee bytes: https://pastebin.com/xyUwFFta – Yvain Mar 14 '22 at 13:48
  • 1
    Your sample decodes to `EN\xa060825\x1e1:20142), Safety of laser products - Part 1: Equipment classification and requirements` where `\xa0` and `\x1e` are non-printable characters. What characters do you actually expect in their place? – tripleee Mar 14 '22 at 14:05
  • https://tripleee.github.io/8bit/#a0 doesn't immediately suggest any useful standard 8-bit encoding, and 0x1E is an obscure ASCII control character; I guess Word might have its own conventions from the most desperate depths of the Microsoft abyss. – tripleee Mar 14 '22 at 14:06
  • @tripleee \xa0 should be a space and \x1e should be a hyphen. I had found something somewhere that suggests the issue is the first byte is mapped to a nonbreaking space and the second to a nonbreaking hyphen. – Yvain Mar 14 '22 at 14:09
  • 1
    All I can suggest based on the available evidence is to manually add a mapping for each of these problematic characters to their corresponding Unicode equivalent as you discover them, like you already haphazardly appear to be doing for `\x0b`. Eventually, maybe you can pivot to a less braindead document format. – tripleee Mar 14 '22 at 14:13
  • `\xa0` is _printable_ (U+00A0, *No-Break Space*)… – JosefZ Mar 14 '22 at 20:07

0 Answers0