Extracting values using regular expression

Question

I have a long string extracted from an XML file which I've dumped to notepad++. Below is a snippet of that XML. These are 3 rows of data and I need to clean this up. What I want is to create a dataset with final output SAPClient 1 RCLNT Ledger 2 RLDNR CompanyCode 3 RBUKRS

As you can probably figure out, I need the values from attribute id, order and columnName fields. I was unsure of how to do this. But thought to use notepad++ regular expression feature. I thought I would group the fields I need and replace as \1 \2 \3 but I am not so good yet on this. The below regex is where I am at right now which selects the first row of each xml attribute. But am not going further and am running out of ideas. Please help.

REGEX: (<attribute\sid="[a-zA-Z0-9]+").+

XML:

`<attribute id="SAPClient" order="1" attributeHierarchyActive="false" displayAttribute="false">
        <descriptions defaultDescription="SAP Client"/>
        <searchProperties/>
        <keyMapping columnObjectName="Join_2" columnName="RCLNT"/>
      </attribute>
      <attribute id="Ledger" order="2" attributeHierarchyActive="false" displayAttribute="false">
        <descriptions defaultDescription="Ledger"/>
        <searchProperties/>
        <keyMapping columnObjectName="Join_2" columnName="RLDNR"/>
      </attribute>
      <attribute id="CompanyCode" order="3" attributeHierarchyActive="false" displayAttribute="false">
        <descriptions defaultDescription="Company Code"/>
        <searchProperties/>
        <keyMapping columnObjectName="Join_2" columnName="RBUKRS"/>
      </attribute>`

I suggest using the "Evaluate XPath Expression" feature from the XML Tools plugin instead of relying on regex. Here's how you would access the `attribute` tags' `id` attribute with XPath : `//attribute/@id` — Aaron, May 21 '17 at 13:03
I don't know anything about the plugin Aaron mentioned. But when people ask questions about regexes on stackoverflow, the best answer is very often "don't use a regex, use this library/module/plugin instead.". That way you don't have to worry about unusual characters, changes in the order of attributes in an element, and other quirks. — David Knipe, May 21 '17 at 16:57
I understand that that might not be the best way to do it but the plugin method in this case is something I was not able to comprehend. — Analytics1988, May 21 '17 at 17:37
@shyamUthaman the solution I mentioned was to use [XPath](https://en.wikipedia.org/wiki/XPath) to access data from an XML document, as it is a language specially devised to do so. Regex on the other hand has no understanding of the XML format and crafting one which would handle all edge-cases would be a terrible exercise. If you just have one use-case with "nice" data it shouldn't be a problem, but if you often have to extract data from XML you should definitely pick up XPath. If you often have to transform XML data you'll want to use XQuery or the older XSLT, both of which rely on XPath — Aaron, May 22 '17 at 15:43

score 0 · Answer 1 · answered May 21 '17 at 17:22

0

In Notepad++ :

Find what: .*?<attribute id="(\w+)" order="(\d+)".*?keyMapping.*?columnName="(\w+)".*?<\/attribute>
Replace with : $1\t$2\t$3\n
Search mode : Regular Expressions, with the ". matches newlines" checked.

answered May 21 '17 at 17:22

LukStorms

28,916
5
31
45

Thanks! You are awesome. – Analytics1988 May 21 '17 at 17:37
@shyamUthaman Yeah well, people here often don't like to solve html/xml questions with regex. Because [regex often isn't the best choice for that](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). – LukStorms May 21 '17 at 17:47

Extracting values using regular expression

1 Answers1