2

I need to fetch tabular data from MS-Word file. The code that I referred to only fetches the first and last rows, but I need to fetch the entire table.

Later, the data that was fetched has to be cross-checked if the same filename exists in the folder.

I am not even able to understand the flow of the code, as I am new to the Win32::OLE module.

I have referred to a similar question about fetching data on this site, but couldn't get it.

Please let me know how to proceed.

#!/usr/bin/perl 

use strict;
use warnings;

use File::Spec::Functions qw( catfile );
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';

$Win32::OLE::Warn = 3;

my $word = get_word();
$word->{DisplayAlerts} = wdAlertsNone;
$word->{Visible}       = 1;

my $doc    = $word->{Documents}->Open('D:\A.doc');
my $tables = $word->ActiveDocument->{'Tables'};

for my $table (in $tables) {
  my $tableText = $table->ConvertToText({ Separator => wdSeparateByTabs });
  print "Table: " . $tableText->Text() . "\n";
}

$doc->Close(0);

sub get_word {
  my $word;
  eval { $word = Win32::OLE->GetActiveObject('Word.Application'); };
  die "$@\n" if $@;
  unless (defined $word) {
    $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
        or die "Oops, cannot start Word: ", Win32::OLE->LastError, "\n";
  }
  return $word;
}

UPDATE: A.doc

Article    No.      Count No      Committee
 A0029     A0029    16            E01.07
 B0028     B0028    34            E04.09
 C0036     C0036    17            E09.00
 D0033     D0033    15            E08.07

Output in CMD

D:\Word>A.pl
D0033   D0033   15      E08.07No                     Committee
flora
  • 25
  • 5
  • You need to refer to the [Word Developer Reference](https://msdn.microsoft.com/en-us/library/office/ff841702%28v=office.14%29.aspx) which lists all of the classes that are available together with their properties and methods. But I can't see why your code should be fetching only the first and last rows. The `Table.ConvertToText` method returns a `Range` object, which is essentially just the character positions of the beginning and end of the table. And the `Range.Text` property is just the text between those two positions, so not much can go wrong. Are you viewing this on a console? – Borodin Apr 29 '15 at 10:45
  • Are you able to publish your Word document so that we can try it out ourselves? – Borodin Apr 29 '15 at 10:46
  • @Borodin: I am viewing it on console (cmd) and I run the program as D:\>A.pl I tried the code in various doc files which has multiple rows as well as only on two row, but I am getting the same output.. – flora Apr 29 '15 at 11:18
  • Have updated the sample of A.doc file and also the output generated. @sinan : The code is referred from [link](http://stackoverflow.com/questions/13185835/read-ms-word-table-data-row-wise-using-win32ole-perl) – flora Apr 29 '15 at 11:54
  • @sinan:I just wanted to give out the link from where I referred the code and had no intention of not giving you any credit. I owe you guys who are out there to help beginners learn. I suppose you would help me out to get out of this error in the code and understand it.. – flora Apr 29 '15 at 18:42

1 Answers1

1

The problem is caused by the fact that table rows are terminated with CR characters in the text returned by the ConvertToText method:

C:\...\Temp> perl word-table.pl A.doc | xxd
00000000: 4172 7469 636c 6509 4e6f 2e09 436f 756e  Article.No..Coun
00000010: 7420 4e6f 0943 6f6d 6d69 7474 6565 0d41  t No.Committee.A
00000020: 3030 3239 0941 3030 3239 0931 3609 4530  0029.A0029.16.E0
00000030: 312e 3037 0d42 3030 3238 0942 3030 3238  1.07.B0028.B0028
00000040: 0933 3409 4530 342e 3039 0d43 3030 3336  .34.E04.09.C0036
00000050: 0943 3030 3336 0931 3709 4530 392e 3030  .C0036.17.E09.00
00000060: 0d44 3030 3333 0944 3030 3333 0931 3509  .D0033.D0033.15.
00000070: 4530 382e 3037 0d0d 0a                   E08.07...

To solve, replace carriage returns with newlines:

#!/usr/bin/env perl

use strict;
use warnings;

use Carp qw( croak );
use Cwd qw( abs_path );
use Path::Class;
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';

$Win32::OLE::Warn = 3;

run(\@ARGV);

sub run {
    my $argv = shift;
    my $word = get_word();

    $word->{DisplayAlerts} = wdAlertsNone;
    $word->{Visible}       = 1;

    for my $word_file ( @$argv ) {
        print_tables($word, $word_file);
    }

    return;
}

sub print_tables {
    my $word = shift;
    my $word_file = file(abs_path(shift));

    my $doc = $word->{Documents}->Open("$word_file");
    my $tables = $word->ActiveDocument->{Tables};

    for my $table (in $tables) {
        my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
        $text =~ s/\r/\n/g;
        print $text, "\n";
    }

    $doc->Close(0);
    return;
}

sub get_word {
    my $word;
    eval { $word = Win32::OLE->GetActiveObject('Word.Application'); 1 }
        or die "$@\n";
    $word and return $word;
    $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
        or die "Oops, cannot start Word: ", Win32::OLE->LastError, "\n";
    return $word;
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339