0

I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes.

rx1=prxparse("s/<.*?>//");

My table contains a text filed with the strings something like this

    <meta name="generator" content="HTML Tidy, see www.w3.org" /> Test 
    <table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;" border="0" 
cellspacing="0" cellpadding="0" width="480"> <tr style="HEIGHT: 15.0pt;"> 
    <td style="BORDER-BOTTOM: rgb(236,233,216);BORDER-LEFT: rgb(236,233,216);
BACKGROUND-COLOR: transparent;WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480"> 

So it contains <table> <tr <td . . . and other complex html structures. How to parse this kind of html into plain text ?

Buras
  • 3,069
  • 28
  • 79
  • 126
  • 2
    Not possible in a general sense, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1733489#1733489 for example. HTML is too complicated and irregular for any sort of regular expression. You can do it in a limited sense, as in parse a particular page or structure, but it's practically case by case, and not something that is appropriate to a question here in this general of a form. – Joe Jun 25 '14 at 20:36
  • But I have done with with Java, C++ etc. There are many html parser APIs . I thought that SAS might also have some `sas-html-parser.sas` user defined files that I could `%include` and use similar to the APIs – Buras Jun 25 '14 at 21:32
  • There isn't really a single one in SAS, no. There is `proc http` which is useful for fetching data from forms and such, and there are some more complex tools in some of the BI suites, but in base SAS there isn't a direct HTML parser. There are direct XML parsers (libname method, using an xml map), if you have that. – Joe Jun 25 '14 at 21:36
  • Thank you. It is good to know. I should stop searching then – Buras Jun 25 '14 at 21:53
  • It's worth noting that if you are working with XML files you can use SAS XML mapper (which is fine unless you have thousands of files to deal with in which case I found it too slow). You can download it here: http://support.sas.com/kb/33/584.html – Robert Penridge Jun 26 '14 at 17:30
  • 1
    One interesting thing you can do is use `proc groovy`, if you're familiar with Groovy. That has some HTML parsers written for it that are pretty powerful, I believe. – Joe Jun 26 '14 at 20:11
  • I did not know about it. Is it only for SAS9.3? I have 9.2, can I still use it ? – Buras Jun 26 '14 at 22:25

1 Answers1

2

What @Joe says in the comments is true but unfortunately it doesn't negate that fact that people often need to solve this kind of problem. Below are some macros that I use when I need to extract certain values out of XML/HTML. It's not perfect but it's gotten the job done for everything I've needed.

The major limitation of the below macros is that they require the HTML/XML they are parsing to exist in a single field in SAS. The size limitation of a single field in SAS is 32767 chars, which means that if your HTML file is bigger than that then you will need to take just the subset of it that you need to work with.

Examples are included and the best way to figure out how it works is just to run the examples.

/*****************************************************************************
**  PROGRAM: PRXEXTRACT.SAS
**
**  SEARCHES THROUGH AN XML (OR HTML) FILE FOR AN ELEMENT AND EXTRACTS THE 
**  VALUE BETWEEN AN ELEMENTS TAGS.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iField        : The fieldname to save the result to.
**  iType         : (N or C) for Numeric or Character.
**  iLength       : The length of the field to create.  
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1 or 2). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 14-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 16-FEB-2011  BY:RP
**  - ADDED OPTION TO CHANGE DELIMITERS FROM <> TO []
**  1.1 MODIFIED: 17-FEB-2011  BY:RP
**  - CORRECTED ERROR WHEN MATCH RETURNS A LENGTH OF ZERO
**  - CORRECTED MISSING AMPERSAND FROM IDELIMITERTYPE CHECK.
**  - ADDED ESCAPING QUOTES TO [] DELIMITER TYPE
**  - CORRECTED WARNING WHEN MATCH RETURNS MISSING NUMERIC FIELD
**  1.2 MODIFIED: 25-FEB-2011  BY:RP
**  - ADDED DELIMITER TYPES TO WORK WITH MASKED HTML CODES
**  1.3 MODIFIED: 11-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
**  1.4 MODIFIED: 14-MAR-2011  BY:RP
**  - CORRECTED TO REMOVE FALSE MATCHES FROM PRIOR VERSION. ADDED EXAMPLE.
**  1.5 MODIFIED: 10-APR-2012  BY:RP
**  - CORRECTED PROBLEM WITH ZERO LENGTH STRING MATCHES
**  1.6 MODIFIED: 22-MAY-2012  BY:RP
**  - ADDED ABILITY TO CAPTURE ATTRIBUTES
*****************************************************************************/

%macro prxExtract(iElement=, iField=, iType=, iLength=, iXMLField=, iDelimiterType=1, iSequence=1, iAttributesField=);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxExtract.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  %if %sysfunc(index(&iField,[)) %then %do;
    /* DONT DO THIS IF ITS AN ARRAY */
  %end;
  %else %do;
    %if "%upcase(&iType)" eq "N" %then %do;
      attrib &iField length=&iLength format=best.;
    %end;
    %else %do;
      attrib &iField length=$&iLength format=$&iLength..;
    %end;
  %end;

  /*
  ** BREAKDOWN OF REGULAR EXPRESSION (EXAMPLE USES < AND > AS DELIMS AND ANI AS THE ELEMENT BEING LOOKED FOR:
  **
  ** &delim_open&iElement                            -->  FINDS <ANI
  ** (\s+.*?&delim_close|&delim_close){1}?           -->  FINDS THE SHORTEST SINGLE INSTANCE OF EITHER:
  **                                                      - ONE OR MORE SPACES FOLLOWED BY ANYTHING UNTIL A > CHARACTER
  **                                                      - OR JUST A > CHARACTER
  **                                                      THE ?: JUST TELLS IT NOT TO CAPTURE WHAT IT FOUND INBETWEEN THE ( AND )
  ** (.*?)                                           -->  FINDS WHAT WE ARE SEARCHING FOR AND CAPTURES IT INTO BUFFER 1.
  ** &delim_open                                     -->  FINDS <
  ** \/                                              -->  FINDS THE / CHARACTER. THE FIRST SLASH ESCAPES IT SO IT KNOWS ITS NOT A SPECIAL REGEX SLASH
  ** &iElement&delim_close                           -->  FINDS ANI>
  */
  prx_id = prxparse("/&delim_open&iElement((\s+.*?)&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_start = 1;
  prx_stop = length(&iXMLField);
  prx_sequence = 0;
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_sequence = prx_sequence + 1;
    if prx_sequence = &iSequence then do;
      if prx_length > 0 then do;

        call prxposn(prx_id, 3, prx_pos, prx_length);
        %if "%upcase(&iType)" eq "N" %then %do;
          length prx_tmp_n $200;
          prx_tmp_n = substr(&iXMLField, prx_pos, prx_length);
          if cats(prx_tmp_n) ne "" then do;
            &iField = input(substr(&iXMLField, prx_pos, prx_length), ?best.);
          end;
        %end;
        %else %do;          
          if prx_length ne 0 then do;
            &iField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iField = "";
          end;
        %end;

        **
        ** ALSO SAVE THE ATTRIBUTES TO A FIELD IF REQUESTED
        *;
        %if "%upcase(&iAttributesField)" ne "" %then %do;
          call prxposn(prx_id, 2, prx_pos, prx_length);
          if prx_length ne 0 then do;
            &iAttributesField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iAttributesField = "";
          end;
        %end;

      end;
    end;
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx:;

%mend;

Example for a single element:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><XNI xniattrib=1>7606256091</XNI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=xni, iField=my_xni, iType=c, iLength=15, iXMLField=xml, iSequence=1, iAttributesField=my_xni_attribs);

run;

Example for repeating elements:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><ANI>7606256091</ANI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=ani2digits, iField=ani2digits, iType=c, iLength=50, iXMLField=xml);

  length ani1-ani6 $15;
  length attr1-attr6 $100;
  array arrani [1:6] $ ani1-ani6;
  array arrattr [1:6] $ attr1-attr6;
  %prxCount  (iElement=ani, iXMLField=xml, iDelimiterType=1);
  do cnt=1 to prx_count;
    %prxExtract(iElement=ani, iField=arrani[cnt], iType=c, iLength=15, iXMLField=xml, iSequence=cnt, iAttributesField=arrattr[cnt]);
  end;

run;

Finally - if you are need the version for multiple elements you will also need the prxcount macro:

/*****************************************************************************
**  PROGRAM: MACROS.PRXCOUNT.SAS
**
**  RETURNS THE NUMBER OF TIMES AN ELEMENT IS FOUND IN AN HTML/XML FILE.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1/2/3). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**                  3 USES ENCODED VALUES FOR <>.   
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 25-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 14-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
*****************************************************************************/

%macro prxCount(iElement=, iXMLField=, iDelimiterType=1);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxCount.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  prx_id = prxparse("/&delim_open&iElement(\s+.*?&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_count = 0;
  prx_start = 1;
  prx_stop  = length(&iXMLField);
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_count = prx_count + 1;
    call prxposn(prx_id, 1, prx_pos, prx_length);
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx_:;

%mend;
Robert Penridge
  • 8,424
  • 2
  • 34
  • 55
  • This is awesome. Thank you for sharing your code with everybody. I think I am not the only one who will benefit from it – Buras Jun 26 '14 at 18:44
  • I tried a simple test but it failed `data example; xml = "Test"; %prxExtract(iElement=xni, iField=my_xni, iType=c, iLength=15, iXMLField=xml, iSequence=1, iAttributesField=my_xni_attribs); run;` – Buras Jun 26 '14 at 19:15
  • @buras For the `iElement` parameter you should pass `test` as the element in your example is named ``. Also, the iAttributes field will only be populated if your element contains information like this: ``. – Robert Penridge Jun 27 '14 at 14:26