Excel - Extract substring(s) from string using FILTERXML

Question

_Background

Lately I've been trying to get more familiar with the concept of changing a delimited string into an XML to parse with Excel's FILTERXML and retrieve those substrings that are of interest. Please note that this function came available from Excel 2013 and is not available on Excel for Mac nor Excel Online.

With a delimited string, I meant anything from a normal sentence using spaces as delimiters or any other combination of characters that could be used to define substrings within a string. For example let's imagine the following:

ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123

_Question

So, where a lot of people know how to get the nth element (e.g.: =TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),3*LEN(A1)+1,LEN(A1))) to retrieve 456). Or other combinationes with LEN(), MID(), FIND() and all those constructs, how do we use FILTERXML to use more specific criteria to extract substrings of concern and clean up the full string? For example, how to retrieve:

elements by position
numeric or non-numeric elements
elements that contain a substring on their own
elements that start or end with a substring
elements that are upper- or lowercase
elements holding numbers
unique values
...

JvdV · Accepted Answer · 2023-02-14T15:09:28.793

Excel's FILTERXML uses XPATH 1.0 which unfortunately means it is not as diverse as we would maybe want it to be. Also, Excel seems to not allow returning reworked node values and exclusively allows you to select nodes in order of appearance. However there is a fair share of functions we can still utilize. More information about that can be found here.

The function takes two parameters: =FILTERXML(<A string in valid XML format>,<A string in valid XPATH format>)

Let's say cell A1 holds the string: ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123. To create a valid XML string we use SUBSTITUTE to change the delimiter to valid end- and start-tag constructs. So to get a valid XML construct for the given example we could do:

"<t><s>"&SUBSTITUTE(A1,"|","</s><s>")&"</s></t>"

For readability reasons I'll refer to the above construct with the word <XML> as a placeholder. Below you'll find different usefull XPATH functions in a valid construct to filter nodes:

_{1) All Elements:}

=FILTERXML(<XML>,"//s")

_{Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes)}

_{2) Elements by position:}

=FILTERXML(<XML>,"//s[position()=4]")

Or:

=FILTERXML(<XML>,"//s[4]")

_{Returns: 456 (node on index 4)}^†

=FILTERXML(<XML>,"//s[position()<4]")

_{Returns: ABC, 123 and DEF (nodes on index < 4)}

=FILTERXML(<XML>,"//s[position()=2 or position()>5]")

_{Returns: 123, ZY-2F, XY-3F, XY-4f, xyz and 123 (nodes on index 2 or > 5)}

=FILTERXML(<XML>,"//s[last()]")

_{Returns: 123 (node on last index)}

=FILTERXML(<XML>,"//s[position() mod 2 = 1]")

_{Returns: ABC, DEF, XY-1A, XY-3F and xyz (odd nodes)}

=FILTERXML(<XML>,"//s[position() mod 2 = 0]")

_{Returns: 123, 456, ZF-2F, XY-4f and 123 (even nodes)}

_{3) (Non) numeric elements:}

=FILTERXML(<XML>,"//s[number()=.]")

Or:

=FILTERXML(<XML>,"//s[.*0=0]")

_{Returns: 123, 456, and 123 (numeric nodes)}

=FILTERXML(<XML>,"//s[not(number()=.)]")

Or:

=FILTERXML(<XML>,"//s[.*0!=0)]")

_{Returns: ABC, DEF, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (non-numeric nodes)}

_{4) Elements that (not) contain:}

=FILTERXML(<XML>,"//s[contains(., 'Y')]")

_{Returns: XY-1A, ZY-2F, XY-3F and XY-4f (containing 'Y', notice XPATH is case sensitive, exclusing xyz)}

=FILTERXML(<XML>,"//s[not(contains(., 'Y'))]")

_{Returns: ABC, 123, DEF, 456, xyz and 123 (not containing 'Y', notice XPATH is case sensitive, including xyz)}

_{5) Elements that (not) start or/and end with:}

=FILTERXML(<XML>,"//s[starts-with(., 'XY')]")

_{Returns: XY-1A, XY-3F and XY-4f (starting with 'XY')}

=FILTERXML(<XML>,"//s[not(starts-with(., 'XY'))]")

_{Returns: ABC, 123, DEF, 456, ZY-2F, xyz and 123 (don't start with 'XY')}

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F']")

_{Returns: DEF, ZY-2F and XY-3F (end with 'F', notice XPATH 1.0 does not support ends-with)}

=FILTERXML(<XML>,"//s[not(substring(., string-length(.) - string-length('F') +1) = 'F')]")

_{Returns: ABC, 123, 456, XY-1A, XY-4f, xyz and 123 (don't end with 'F')}

=FILTERXML(<XML>,"//s[starts-with(., 'X') and substring(., string-length(.) - string-length('A') +1) = 'A']")

_{Returns: XY-1A (start with 'X' and end with 'A')}

_{6) Elements that are upper- or lowercase:}

=FILTERXML(<XML>,"//s[translate(.,'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')=.]")

_{Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F and 123 (uppercase nodes)}

=FILTERXML(<XML>,"//s[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')=.]")

_{Returns: 123, 456, xyz and 123 (lowercase nodes)}

_{NOTE: Unfortunately XPATH 1.0 does not support upper-case() nor lower-case() so the above is a workaround. Add special characters if need be.}

_{7) Elements that (not) contain any number:}

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=.]")

_{Returns: 123, 456, XY-1A, ZY-2F, XY-3F, XY-4f and 123 (contain any digit)}

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')=.]")

_{Returns: ABC, DEF and xyz (don't contain any digit)}

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=. and .*0!=0]")

_{Returns: XY-1A, ZY-2F, XY-3F and XY-4f (holding digits but not a a number on it's own)}

_{8) Unique elements or duplicates:}

=FILTERXML(<XML>,"//s[preceding::*=.]")

_{Returns: 123 (duplicate nodes)}

=FILTERXML(<XML>,"//s[not(preceding::*=.)]")

_{Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (unique nodes)}

=FILTERXML(<XML>,"//s[not(following::*=. or preceding::*=.)]")

_{Returns: ABC, DEF, 456, XY-1A, ZY-2F, XY-3F and XY-4f (nodes that have no similar sibling)}

_{9) Elements of certain length:}

=FILTERXML(<XML>,"//s[string-length()=5]")

_{Returns: XY-1A, ZY-2F, XY-3F and XY-4f (5 characters long)}

=FILTERXML(<XML>,"//s[string-length()<4]")

_{Returns: ABC, 123, DEF, 456, xyz and 123 (shorter than 4 characters)}

_{10) Elements based on preceding/following:}

=FILTERXML(<XML>,"//s[preceding::*[1]='456']")

_{Returns: XY-1A (previous node equals '456')}

=FILTERXML(<XML>,"//s[starts-with(preceding::*[1],'XY')]")

_{Returns: ZY-2F, XY-4f, and xyz (previous node starts with 'XY')}

=FILTERXML(<XML>,"//s[following::*[1]='123']")

_{Returns: ABC, and xyz (following node equals '123')}

=FILTERXML(<XML>,"//s[contains(following::*[1],'1')]")

_{Returns: ABC, 456, and xyz (following node contains '1')}

=FILTERXML(<XML>,"//s[preceding::*='ABC' and following::*='XY-3F']")

Or:

=FILTERXML(<XML>,"//s[.='ABC']/following::s[following::s='XY-3F']")

_{Returns: 123, DEF, 456, XY-1A and ZY-2F (everything between 'ABC' and 'XY-3F')}

_{11) Elements based on sub-strings:}

=FILTERXML(<XML>,"//s[substring-after(., '-') = '3F']")

_{Returns: XY-3F (nodes ending with '3F' after hyphen)}

=FILTERXML(<XML>,"//s[contains(substring-after(., '-') , 'F')]")

_{Returns: ZY-2F and XY-3F (nodes containing 'F' after hyphen)}

=FILTERXML(<XML>,"//s[substring-before(., '-') = 'ZY']")

_{Returns: ZY-2F (nodes starting with 'ZY' before hyphen)}

=FILTERXML(<XML>,"//s[contains(substring-before(., '-'), 'Y')]")

_{Returns: XY-1A, ZY-2F, XY-3F and XY-4f (nodes containing 'Y' before hyphen)}

_{12) Elements based on concatenation:}

=FILTERXML(<XML>,"//s[concat(., '|', following::*[1])='ZY-2F|XY-3F']")

_{Returns: ZY-2F (nodes when concatenated with '|' and following sibling equals 'ZY-2F|XY-3F')}

=FILTERXML(<XML>,"//s[contains(concat(., preceding::*[2]), 'FA')]")

_{Returns: DEF (nodes when concatenated with sibling two indices to the left contains 'FA')}

_{13) Empty vs. Non-empty:}

=FILTERXML(<XML>,"//s[count(node())>0]")

Or:

=FILTERXML(<XML>,"//s[node()]")

_{Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes that are not empty)}

=FILTERXML(<XML>,"//s[count(node())=0]")

Or:

=FILTERXML(<XML>,"//s[not(node())]")

_{Returns: None (all nodes that are empty)}

_{14) Preceding or Following:}

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F'][last()]/following::*")

_{Returns: XY-4f, xyz and 123 (all nodes to the right of the last node that ends with an uppercase 'F')}

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F'][1]/preceding::*")

_{Returns: ABC and 123 (all nodes to the left of the first node that ends with an uppercase 'F')}

_{15) (Preceding or Following) and self:}

=FILTERXML(<XML>,"(//s[.*0!=0][last()]|//s[.*0!=0][last()]/preceding::*)")

_{Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (trim all numeric nodes from the right)}^††

=FILTERXML(<XML>,"(//s[.*0=0][1]|//s[.*0=0][1]/following::*)")

_{Returns: 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (trim all non-numeric nodes from the left)}

_{16) Maximum or Minimum:}

=FILTERXML(<XML>,"(//s[.*0=0][not(.<//s[.*0=0])])[1]")

_{Returns: 456 (The maximum value looking at numeric nodes)}

=FILTERXML(<XML>,"(//s[.*0=0][not(.>//s[.*0=0])])[1]")

_{Returns: 123 (The minimum value looking at numeric nodes)}

_{NOTE: This is the equivalent to returning all numeric nodes as per #3 and post-process the array using Excel's MIN() and MAX() functions.}

Now obviously the above is a demonstration of possibilities with XPATH 1.0 functions and you can get a whole range of combinations of the above and more! I tried to cover most commonly used string functions. If you are missing any please feel free to comment.

Whereas the question is quite broad on itself, I was hoping to give some general direction on how to use FILTERXML for the queries at hand. The formula returns an array of nodes to be used in any other way. A lot of the times I would use it in TEXTJOIN() or INDEX(). But I guess other options would be new DA-functions to spill results.

Be alert that while parsing a string through FILTERXML(), the ampersand character (&) and the left angle bracket (<) must not appear in their literal form. They will respectively need to be substituted with either & or <. Another option would be to use their numeric ISO/IEC 10646 character code being & or < respectively. After parsing, the function will return these characters back to you in their literal form. Needless to say that splitting a string by the semi-colon therefor just became tricky.

^‡_{Each predicate, the structure between the opening and closing square brackets, is a filter of a given nodelist. To write multiple of these structures is in fact anding such predicates.}

^‡‡_{There isn't really an easy, following/preceding siblings and self, construct. Therefor I used the Union operator. This, however, requires multiple expressions to be inside paranthesis. Much like alternations within a capture group if one would think about regular expressions.}

FYI You might be interested in a tricky way to change a digits string into a sorted array of unique digits via `FilterXML`, enriched by some explanations (as well as a link to the above post :-) at [Divide numbers into unique sorted digits](https://stackoverflow.com/questions/63540928/divide-numbers-into-unique-sorted-digits-displayed-in-a-label-on-a-userform) - @JvdV — T.M., Aug 24 '20 at 18:50
When we can filter numeric nodes only by `=FILTERXML(,"//s[.*0=0]")`. Can we apply condition here like I want filter numeric nodes `<200`? — Harun24hr, Jan 27 '21 at 05:49
@Harun24HR, yes you can use [logical operators](https://www.w3schools.com/xml/xpath_operators.asp) like "less then". You can directly apply e.g: `=FILTERXML(,"//s[.<200]")` in the above example which would return both "123"-nodes. — JvdV, Jan 27 '21 at 06:31
@JvdV That works fine! Honestly saying I have learned about `FILTERXML()` from your this post. Grateful to you. — Harun24hr, Jan 27 '21 at 06:40
@JvdV Fyi Whereas top post focusses on the tabular Syntax with a nearly encyclopaedic number of XPath examples, it might be of some interest how to handle **results** in (dynamic) VBA: [example here](https://stackoverflow.com/questions/69637009/getting-3-digits-from-a-set-of-numbers-using-regex-for-vba/69645093#69645093) — T.M., Oct 25 '21 at 10:46
I have seen that post this morning (and +'ed it for you). Used the function through VBA once or twice, though sparsely, myself. @T.M. — JvdV, Oct 25 '21 at 10:50
@JvdV Sir, this is just superb and mind blowing, the day you posted since then I am using it, much helpful for me!!! Gracias!!! — Mayukh Bhattacharya, Feb 02 '22 at 02:18
Hi JvdV, the FILTERXML function is not working for me anymore. Not sure if it has anything to do with the recent Excel update. Could you please let me know if you can still use the function as usual? — Terry W, Apr 12 '22 at 00:19
@TerryW, hello. I find there is nothing wrong on my end. I'm on the very latest update of Excel (365's BETA channel). — JvdV, Apr 12 '22 at 06:30
@JvdV Could you confirm that the `FilterXML` *XPath expressions* needs xmldom functions to be written only lower case - see my recent *Caveat* hint at [Find element in array ...](https://stackoverflow.com/questions/72883948/how-to-find-an-element-in-an-array-through-a-specific-character-in-the-element/73859472#73859472) — T.M., Sep 27 '22 at 19:19
@T.M., yes, keywords in xpath are written in lower-case. You are correct in your statement that functions will error if written differently. — JvdV, Sep 27 '22 at 20:43
Great explanation @JvdV, not I understand it better. Worth noticing that according to [FILTERXML](https://support.microsoft.com/en-us/office/filterxml-function-4df72efc-11ec-4951-86f5-c1374812f5b7) documentation as of 11/15/2022 *is not available in Excel for the web and Excel for Mac* — David Leal, Nov 15 '22 at 15:53

JvdV · Answer 2 · 2023-05-31T18:40:52.690

EDIT 22-3-2022:

As per the new TEXTSPLIT() function, one can now split a string straight into an horizontal or vertical range (or both). It can even split on an array of values. However, it can not directly filter as 'xpath' can. Therefor I'll keep the below answer intact as it can still be relevant.

The infamous missing `XSPLIT()` function

This post is meant as a little in-depth extra to show how we can make our own re-usable XSPLIT() function using FILTERXML() but without the use of VBA. Though currently in BETA, LAMBDA() is comming our way and with this function we can create our own custom function. Let me explain this at the hand of an example:

The formula in C1 is simply =XSPLIT(A1,B1:B3) and it spills the delimited text values in order of appearance. However SPLIT() is the name of our LAMBDA() function we created in the "name manager":

=LAMBDA(txt,del,[xpath],FILTERXML("<t><s>"&TEXTJOIN("</s><s>",,TEXTSPLIT(txt,del))&"</s></t>","//s"&xpath))

As you can see, the function has 4 parameters:

txt - A reference to our source value.
del - Any amount of delimiters we'd like. Written or referenced.
[xpath] - Place for a xpath expression to apply some filter if need be. E.g: "[.*0=0]" to only return numeric substrings. This is an optional parameter
FILTERXML("<t><s>"&TEXTJOIN("</s><s>",,TEXTSPLIT(txt,del))&"</s></t>","//s"&xpath)

The 4th parameter is where all the 3 previous parameters are being called to create the same construct as been covered in the main post. Now, since MS does not want to give us themselves, we created our own XSPLIT() function with the three parameters.

The main post concentrates on the SUBSTITUTE() of a specific delimiter; in the given example the pipe-symbol. But what if you have several delimiters? You'd need multiple nested SUBSTITUTE() functions, right? Wouldn't it be great if we can implement that in our XSPLIT() function too? Here is where TEXTSPLIT() is getting exciting for me personally because we can feed this function multiple delimiters to split a string into elements.

We have now created our own XSPLIT() function with three parameters:

=XSPLIT(<StringToBeSplited>,<YourDelimiters>,<OptionalXpath>)

And we can now use this as a function in our whole workbook. Enjoy!

Making fellow South Africans proud with superb answer - nice one boet (assuming I haven't gotten genders misapproprated :) . I have only dappled in Lambda, and still need to get my head around the reduce function - there's no peak when to Excel knowledge it seems. Anyways - upvote from my end keep it up. — JB-007, Aug 18 '22 at 19:22
Howsit @JB-007. Thanks for the kudos bru. Keep learning and don't shy away from asking if I can help — JvdV, Aug 18 '22 at 21:36

Excel - Extract substring(s) from string using FILTERXML

2 Answers2

The infamous missing `XSPLIT()` function

Linked

Related

Excel - Extract substring(s) from string using FILTERXML

2 Answers2

The infamous missing XSPLIT() function

Linked

Related

The infamous missing `XSPLIT()` function