0

I developed a custom system which simulates web activity, for example downloading files and such. I also have a custom file format to feed into this system. I am looking to change this old system which is written in perl to a newer system in python. But first i have to somehow parse the file.

There are certain fields in the file that I would like to parse, such as the [settings] where I have any arguements for the system. I also have a [macro] section which is the beginning of the important stuff (the steps, etc).

What i have trouble is parsing these sections have my system write it out in a different and much more simpler format (i have thousands of these files and I just want to write a generator to take the old file and write to a new format in a new file).

Old format:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    'Step1' =>
        {
        action                  => 'NAVIGATE',
        inputstring             => 'http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',
        },
    'Step2' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'FirstClick\'\}',
        pass          => 'phHttpDest->\{\'Step2Pass\'\}',
        },
    'Step3' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'SecondClick\'\}',
        },
    'Step4' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'DealClick\'\}',
        accept_multi_match  => 'ANY_TOP_FIRST',
        },
    'Step5' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'LinkClick2\'\}',
        fail          => 'Step6',
    #    accept_multi_match  => 'ANY_TOP_LAST',
        },
    'Step6' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'DocClick\'\}',
        },
    'Step7' =>
        {
        action                  => 'CLICK_DOWNLOAD_OK',
        },
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

And what i want it to spit out:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,phHttpDest->\{\'WebAddress\'\}', 
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',                                                         
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',                                 
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',                     
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',           
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

Where each of the clicks the phHttpDest and the action correlate to the Headings of the [data] section.

  • My first thought would be, can't I use the existing system in perl to process these files, then write them out into a simpler format? – Jon Clements Nov 23 '12 at 16:30
  • Thing is that I wrote a new system in python, and I am just trashing the one in perl, so it would be nice to find a way to build some sort of generator or parser to convert the files in python with new system. –  Nov 23 '12 at 16:38

1 Answers1

2

So one way of doing it is using a set of regular expression replacements to create the files in the new format. I didn't completely understand the rules of your format so I generally implemented the whole thing, but there are some differences. You'll have to go in and make some adjustments to fine tune it. The output.txt file is what gets produced when one uses your example as input.txt

code

import re
data = open('input.txt').read()
data = re.sub(r"    'Step([0-9]+)' =>\s+{\s+action\s+=> ", r"    '\1'     => ", data)
data = re.sub(r"',\s+pass\s+[^,]+,", "", data)
data = re.sub(r"',\s+accept_multi_match\s+[^,]+,", "", data)
data = re.sub(r"\n +#.*\n", "\n", data)
data = re.sub(r"',\s+fail\s+[^,]+,", "", data)
data = re.sub(r"',\s+matchtype\s+[^,]+,", "", data)
data = re.sub(r"',\s+inputstring\s+=> '", ",", data)
data = re.sub(r"\s+matchstring\s+=> '", ",", data)
data = re.sub(r"\n        },", "',", data)
open('output.txt', 'w').write(data)

output.txt

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',',
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',',
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',',
    '7'     => 'CLICK_DOWNLOAD_OK',',
    );

...
Marwan Alsabbagh
  • 25,364
  • 9
  • 55
  • 65
  • Thanks! The rules are fine, I didnt know that regular expressions could be used so similarly in both python and perl. –  Nov 23 '12 at 17:53
  • Yeah it's one of the greatest strengths of perl, regular expressions work great in python too – Marwan Alsabbagh Nov 23 '12 at 18:05