Extract table perl

Question

I am starting to learn the Perl language as it is very useful for my research. I cannot figure out how to extract a table from a text file

I have a folder with a certain number of text files named sequentially like this:

1.txt
2.txt
3.txt
...
...
1000.txt

An example of these files in .txt format can be found at the following link: http://www.sec.gov/Archives/edgar/data/1750/000104746909008102/0001047469-09-008102.txt

The .htm version of the same file can be found at the following link: http://www.sec.gov/Archives/edgar/data/1750/000104746909008102/a2194264zdef14a.htm

Now, the table I am looking for in these files is called sometimes:

Non-Qualified Deferred Compensation Table

some other with small variations like:

Non Qualified Deferred Compensation Table

Basically this table has the these words (sometimes they might slightly vary from file to file) in the headers:

"Contributions"
"Aggregate Earnings"
"Aggregate Withdrawal/Distributions"

and other headers (with slight variations from file to file, but these words appear pretty much in every "Deferred Compensation Table" of each of my .txt files (have a look at the link to the .htm file and .txt file link for an example - search for "Non-Qualified Deferred Compensation Table" in the file). Under these headers, there are some amounts in dollars for a certain number of managers (number of table rows varies from file to file).

Is there a way to create a perl script that extract the deferred compensation table from each file and produces a .csv output with all deferred compensation tables (headers and numbers below) stored along with a reference for each table to the .txt file?

Something like this in the output file:

File    Manager Name    Contributions   Aggregate Earnings  Aggregate Withdrawal/Distributions
1.txt   Manager1    00000   00000   00000
1.txt   Manager2    00000   00000   00000
1.txt   Manager3    00000   00000   00000
2.txt   Manager1    00000   00000   00000
2.txt   Manager2    00000   00000   00000
2.txt   Manager3    00000   00000   00000
3.txt   Manager1    00000   00000   00000
3.txt   Manager2    00000   00000   00000
3.txt   Manager3    00000   00000   00000

I would be most grateful if you could help me with this. I am new and I am trying to learn Perl, but this specific task seems honestly very hard for me.

FYI: Do not use RegEx. http://stackoverflow.com/a/1732454/1791055 — titanofold, Nov 02 '12 at 01:00
Hi Sputnik,I am doing this research to understand how executive compensation works. Would you be willing to help me with code to capture the above tables. I am very stuck with this. Thank you,Stefano — user1792877, Nov 02 '12 at 16:30

score 1 · Answer 1 · edited May 23 '17 at 12:10

1

Perl can achieve this easily.

You should take a look to this Perl modules :

You will find tons of web-scraping examples here or on http://google.com

edited May 23 '17 at 12:10

Community

1
1

answered Nov 02 '12 at 00:52

Gilles Quénot

173,512
41
224
223

2

My favorite is to use Mojo::DOM. http://search.cpan.org/~sri/Mojolicious/lib/Mojo/DOM.pm – titanofold Nov 02 '12 at 00:58
1

XML::XPath has some issues and hasn't been updated since 2003. I'd recommend [XML::LibXML](https://metacpan.org/module/XML::LibXML) for doing XPath queries. (It also supports other kinds of XML parsing.) – friedo Nov 02 '12 at 04:35
Hello,Thank you very much for your reply. What if I actually try to extract the tables from the .txt files? Would it be easier to come up with a perl script for that? – user1792877 Nov 02 '12 at 09:04
In this case, no need WWW::Mechanize – Gilles Quénot Nov 02 '12 at 15:23
Hi there. I tried any possible thing. It is really beyond my capabilities. I would be willing to pay someone to do this job for me. Any suggestion for a professional and serious PERL expert? Thank you veru much, Stefano – user1792877 Nov 03 '12 at 13:00

Extract table perl

1 Answers1