2

The purpose of this is to extract list of field names from the XML (of Adobe LiveCycle Designer). So, I created the fields in designer, then, I copy the XML of the related fields, paste in Notepad++, and then executer find/replace (ctrl-h) to get only the field names, one field in each line.

This will make it then easier to write the SQL statements to add such fields to the DB to register them.

The XML looks like the following:

<field xmlns="http://www.xfa.org/schema/xfa-template/2.8/" y="0in" x="0.343mm" w="8.881pt" h="9.108pt" name="detcon_recreation_only">
   <ui>
      <checkButton size="8.881pt">
         <border>
            <edge stroke="lowered"/>
            <fill/>
         </border>
      </checkButton>
   </ui>
   <font size="0pt" typeface="Adobe Pi Std"/>
   <para vAlign="middle"/>
   <value>
      <text>0</text>
   </value>
   <items>
      <text>1</text>
      <text>0</text>
      <text/>
   </items>
</field>
<field xmlns="http://www.xfa.org/schema/xfa-template/2.8/" name="detcon_special_housing" y="5.393mm" w="27.94mm" h="4.134mm" x="0.343mm">
   <ui>
      <choiceList>
         <border>
            <edge stroke="lowered"/>
         </border>
         <margin/>
      </choiceList>
   </ui>
   <font typeface="Arial Narrow" size="6pt"/>
   <margin topInset="0mm" bottomInset="0mm" leftInset="0mm" rightInset="0mm"/>
   <para vAlign="middle"/>
   <value>
      <text>NA</text>
   </value>
   <items>
      <text>Not Applicable</text>
      <text>Hotel Component</text>
   </items>
   <items save="1" presence="hidden">
      <text>NA</text>
      <text>HC</text>
   </items>
</field>
<exclGroup xmlns="http://www.xfa.org/schema/xfa-template/2.8/" name="detcon_photo_taken" x="0in" y="0in">
   <?templateDesigner itemValuesSpecified 1?>
   <field w="12.446mm" h="3.825mm" name="lb_yes">
      <ui>
         <checkButton size="1.7639mm" shape="round">
            <border>
               <?templateDesigner StyleID apcb1?>
               <edge/>
               <fill/>
            </border>
         </checkButton>
      </ui>
      <font typeface="Myriad Pro"/>
      <margin leftInset="1mm" rightInset="1mm"/>
      <para vAlign="middle"/>
      <caption placement="right" reserve="7.698mm">
         <para vAlign="middle" spaceAbove="0pt" spaceBelow="0pt" textIndent="0pt" marginLeft="0pt" marginRight="0pt"/>
         <font size="8pt" typeface="Arial Narrow" baselineShift="0pt"/>
         <value>
            <text>YES</text>
         </value>
      </caption>
      <value>
         <text xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
      </value>
      <items>
         <text>1</text>
      </items>
   </field>
   <field w="28.702mm" h="3.825mm" name="lb_no" x="13.233mm">
      <ui>
         <checkButton size="1.7639mm" shape="round">
            <border>
               <?templateDesigner StyleID apcb1?>
               <edge/>
               <fill/>
            </border>
         </checkButton>
      </ui>
      <font typeface="Myriad Pro"/>
      <margin leftInset="1mm" rightInset="1mm"/>
      <para vAlign="middle"/>
      <caption placement="right" reserve="23.954mm">
         <para vAlign="middle" spaceAbove="0pt" spaceBelow="0pt" textIndent="0pt" marginLeft="0pt" marginRight="0pt"/>
         <font size="8pt" typeface="Arial Narrow" baselineShift="0pt"/>
         <value>
            <text>NO (see comments)</text>
         </value>
      </caption>
      <value>
         <text xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
      </value>
      <items>
         <text>0</text>
      </items>
   </field>
   <border>
      <edge presence="hidden"/>
   </border>
   <?templateDesigner expand 1?></exclGroup>

So I figured out the following RegEx to perform find/replace to get only the field names, one field on each line.

Find to get Field Name: (?i)<(field|exclGroup).*name="([a-z_]\w*)".*$
Replace: $2

Another find/replace...
Remove all other lines: ^.*<(?!.*name=).*.*[\r\n]*
Replace with blank

If you execute the above two find/replace sessions, you will end up with list of field names one field per line.

What I wanted to do is to perform the above in one find/replace session, and then convert the above into SQL Statements using also find/replace, using this template:

INSERT INTO table_name (element_id, element_name, element_type, default_value, required, clone) 
VALUES (12345,"field_name_goes_here","/Tx", "", "N", "Y"),
VALUES (12346,"field_name_goes_here","/Tx", "", "N", "Y"),
VALUES (12347,"field_name_goes_here","/Tx", "", "N", "Y"),
VALUES (12348,"field_name_goes_here","/Tx", "", "N", "Y"),
VALUES (12349,"field_name_goes_here","/Tx", "", "N", "Y"),

The element_id field is sequential, but don't worry about that, I can take care of this in Excel.

Appreciate your help, Tarek

tarekahf
  • 738
  • 1
  • 16
  • 42
  • Which RDBMS are you using? Most of them have native support to read XMLs... – Shnugo Feb 13 '18 at 19:10
  • @Shnugo I understand from where you are coming. This won't work. The XML I am referring to is specific to Adobe XFA architecture, and has nothing to do with my requirements. I simply want to extract the field names, and generate the INSERT SQL to register them in our DB which is specific to our requirements. – tarekahf Feb 14 '18 at 15:35

2 Answers2

2

Scraper Series

One small help. The element is slightly different than the which is making things a little more complicated. Your RegEx is so sophisticated, I couldn't modify it to include the element. I think I need more time to digest it. So could you modify it to include exclGroup and only extract name only without extracting the inner field elements of the exclGroup?

Ok, here you go.
It does make it a little more complicated.

I have 2 versions to do this. One that uses recursion, one that doesn't.

I'm posting the version that uses recursion.
If you need the non-recursion, let me know and I'll post that.

Find (?:(?!<(?:field|exclGroup)(?!\w)(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+>)[\S\s])*(?><(field|exclGroup)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sname\s*=\s*(?:(['"])([\S\s]*?)\2))\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+>)(?:(?&core)|)</\1\s*>(?:(?!<(?:field|exclGroup)(?!\w)(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+>)[\S\s])*(?(DEFINE)(?<core>(?>(?><([\w:]+)(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+>)(?:(?&core)|)</\5\s*>|(?!</[\w:]+\s*>)(?>[\S\s]))+))

Replace VALUES (12345,"$3","/Tx", "", "N", "Y"),\r\n

https://regex101.com/r/icnF3i/1

Formatted (incase you need to look at it)

 (?:                           # Prefix - Optional any chars that don't start a field or exclGroup tag
      (?!
           < 
           (?: field | exclGroup )
           (?! \w )
           (?>
                " [\S\s]*? "
             |  ' [\S\s]*? '
             |  (?:
                     (?! /> )
                     [^>] 
                )?
           )+
           >
      )
      [\S\s] 
 )*

 (?>                           # open 'field' or 'exclGroup' tag ------------------
      < 
      ( field | exclGroup )         # (1)
      (?=                           # Asserttion (a pseudo atomic group)
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s name \s* = \s* 
           (?:
                ( ['"] )                      # (2), Quote
                ( [\S\s]*? )                  # (3), Name value - only thing we want
                \2 
           )
      )
      \s+ 
      (?>
           " [\S\s]*? "
        |  ' [\S\s]*? '
        |  (?:
                (?! /> )
                [^>] 
           )?
      )+
      >
 )
 (?:
      (?&core)                      # Call the core recursion function (balanced tags)
   |  
 )
 </ \1 \s* >                   # Close 'field' or 'exclGroup' tag ------------------

 (?:                           # Postfix - Optional any chars that don't start a field or exclGroup tag
      (?!
           < 
           (?: field | exclGroup )
           (?! \w )
           (?>
                " [\S\s]*? "
             |  ' [\S\s]*? '
             |  (?:
                     (?! /> )
                     [^>] 
                )?
           )+
           >
      )
      [\S\s] 
 )*

 # ---------------------------------------------------------

 (?(DEFINE)
      (?<core>                      # (4 start), Inner balanced tags
           (?>
                (?>
                     < 
                     ( [\w:]+ )                    # (5), Any open tag
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                     >
                )
                (?:                           # Recurse core 
                     (?&core) 
                  |  
                )
                </ \5 \s* >                   # Balanced close tag (I can see you 5)
             |  
                (?! </ [\w:]+ \s* > )         # Any char not starting a close tag (passive)
                (?> [\S\s] )
           )+

      )                             # (4 end)
 )

You can view the non-recursive version here https://regex101.com/r/ztOrP5/1

  • Wow ... thank you ...! One small help. The element `` is slightly different than the `` which is making things a little more complicated. Your RegEx is so sophisticated, I couldn't modify it to include the `` element. I think I need more time to digest it. So could you modify it to include `exclGroup` and only extract name only without extracting the inner field elements of the exclGroup? – tarekahf Feb 14 '18 at 15:43
  • OK, I need probably a few days to understand this. For now, I completed understanding of Atomic Groups! – tarekahf Feb 14 '18 at 22:31
  • After analyzing your RegEx, I got better understanding of how they work, and I was able to find a simplified version. Could you please check my answer and let me know what you think? – tarekahf Feb 14 '18 at 23:36
1

I am trying to simplify the RegEx provided in the previous answer.

This is my simplified version:

RegEx: (?|(?><field.*name\s*=\s*"([a-z_]\w*)"(?:.|\n)*?(?:<\/field>))|(?:<exclGroup.*name\s*=\s*"([a-z_]\w*)"(?:.|\n)*?(?:<\/exclGroup>)))

Replace: $1

Check it out over here: https://regex101.com/r/icnF3i/3

Appreciate your feedback.

Thanks to sln for helping me to reach this level.


EDIT:
The above RegEx doesn't work in Notepad++.

To use the same under Notepad++ use the following find/replace combination:

Find: (?i)<(field|exclGroup).*name\s*=\s*"([a-z_]\w*)"[\s\S]*?<\/\1>
Replace: \(12345,"$2","/Tx", "", "N", "Y"\),

tarekahf
  • 738
  • 1
  • 16
  • 42
  • Works for your sample. As long as you don't have nested identical tags that should work. Note that you don't need the atomic group `(?>` you could change it to `(?:` without any adverse affects. –  Feb 15 '18 at 16:40
  • Yes, I fully understand what you mean. However, it was very difficult for me to understand the logic of the previous solution. One small thing, the RegEx in this solution works on regex101.com but it doesn't work in Notepad++. I will play around to see how I can fix this, in the meantime, I would appreciate your feedback. – tarekahf Feb 15 '18 at 16:51
  • Well, I would modify your regex to this [Expanded](https://regex101.com/r/SEM6Ri/1) : use only the _expanded_ option in Notepad. Or, same but compressed [Compressed](https://regex101.com/r/SEM6Ri/2) : _don't use any option in Notepad._ Note that if this is failing based on the different xml you're using, I would fallback to the recursive regex. –  Feb 15 '18 at 17:14
  • No it worked in Notepad++. I think it was because of the new line character thing. Check my updated answer. – tarekahf Feb 15 '18 at 17:53