I am currently working on parsing docx files in perl, trying to extract particular lines for usage.
Using Win32::OLE and regular expressions, I'm able to grab these exact lines. However, almost certainly due to an encoding issue, certain information is lost. Take the following line:
XX XXXXX-X:XXXX/XXX:XXXX
In the output txt file, this shows as: XX XXXX�X:XXXX/XXX:XXXX
As you can see, the hyphen has been replaced with an invalid character.
I'm not sure how to prevent this.
Here's how I'm getting all the text from the document:
use 5.010;
use strict;
use warnings;
use utf8;
use Win32::OLE qw(in);
use Win32::OLE::Variant;
use Win32::OLE::Enum;
use Class::CSV;
my $path = 'example.docx';
my $document = Win32::OLE -> GetObject($path);
my $outputfile = 'wordoutput.txt';
open(my $fh, '>', $outputfile) or die "couldn't open file";
binmode($fh, ":utf8");
print "Extracting Text ...\n";
my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);
my $style = '';
my $text = '';
my $overalltext = '';
while(defined($paragraph = $enumerate->Next()))
{
$style = $paragraph->{Style}->{NameLocal};
print $fh "$style";
$overalltext = $overalltext . "$style";
$text = $paragraph->{Range}->{Text};
$text =~ s/[\n\r]//g;
$text =~ s/\x0b/\n/g;
print $fh "$text";
$overalltext = $overalltext . "$text";
}
All research I've done indicates this is an encoding issue (see: any stack overflow question regarding �, such as: Why does a diamond with a questionmark in it � appear in my HTML?), but I'm not sure how to go about fixing it. Thanks in advance for your help.
EDIT: With the help of tripleee (see comments) I've taken a look at the bytes, it seems the issue in this case specifically is a non-breaking hyphen. (and earlier in the same bit, a nonbreaking space). I'm still not exactly sure how to resolve this.