2

How to remove or replace familiar words in string, by list of familiar stubs of that words?

List contain ~40 stubs (substrings). With all of this I expect substitute hundreds matching words.

I'm interested in formula solution because I already know how to do this, coding VBA.


I play around TEXTJOIN with FILTERXML, but its not possible to use it for big stub list:

Formula 1. FILTERXML (one by one entered stub)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(A2;" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti') or contains(., 'AuTi') or contains(., 'tion') or contains(., 'mpl') or contains(., 'Mpl')or contains(., 'etc'))]"));"")

Result is good, but unfortunately is case sensitive, so one by one entered substrings for all capitalization version (Propercase, UPERCASE, lowercase, MixEdCase) generate a very big formula impossible to use because of length limitation (255 characters).

Formula 2. FILTERXML (with case translation)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<T><S>"&SUBSTITUTE(A2;" ";"</S><S>")&"</S></T>";"//S[not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'auti')or contains(., 'tion')or contains(., 'mpl')or contains(., 'etc'))]"));"")

Result is good apparently, but there is some strange missed capitalization matches (check cell D14 in attached image). To avoid this, I can use translation for each stub translate(., 'AUTI', 'auti'),'auti') but length limitation will not leave me to use all list.

Formula 3. FILTERXML (with LOWER-ing all string)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(LOWER(A2);" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti')or contains(., 'tion') or contains(., 'mpl') or contains(., 'etc'))]"));"")

Result is good, but output string is totally lower. And again length limitation is still there.

example image

I know FILTERXML is a good way for this task, but I don't know how to work with. Unfortunately Xpath 1.0 doesn't allow things like matches()

Losai
  • 329
  • 2
  • 9

1 Answers1

2

Very nice question, but you are stretching the capabilities of a formula made through Excel-2019 to it's very limits. Here is what I came up with considering the limitations of xpath 1.0 (no matches() nor lists/array handling:

enter image description here

Formula in B1:

=TEXTJOIN(" ",,IF(ROW(A$1:INDEX(A:A,LEN(A2)))=TRANSPOSE(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,IFERROR(ROW(A$1:INDEX(A:A,COUNTA(FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s"))))&"<s>"&FILTERXML("<t><s>"&SUBSTITUTE(LOWER(A2)," ","</s><s>")&"</s></t>","//s"),""))&"</s></t></x>","//t[.//*[not(contains(., '"&TEXTJOIN("') or contains (., '",,{"auti","tion","mpl","etc"})&"'))]]")),FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s"),""))

I have tested this against an array of 40 items of say 3-4 characters long and it did not exceed the limit of both TEXTJOIN()'s parameters nor the character limit it can handle.

It's an array-entered formula (CSE).

To summarize what this does:

  • Split your input on space (keep case-specifications intact);
  • Create a new input string using TEXTJOIN() to use another level of xpath childs to give each word an index;
  • With FILTERXML() we return each index that did not contain any of the array-elements you specified;
  • Use these numbers to retrieve all nodes from the 1st step and concat those back together;
  • Note that this will impact punctuation.

Note that this will become significantly easier with ms365's lambda's helper function REDUCE():

=REDUCE(A1,{"auti","tion","mpl","etc"},LAMBDA(a,b,TEXTJOIN(" ",,FILTERXML("<t><s>"&SUBSTITUTE(a," ","</s><s>")&"</s></t>","//s[not(contains(translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'), '"&b&"'))]"))))
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 1
    I dont know how you do this. But every time your answer is working, no meter what level of complexity it is. Thank you for patience to build also 365 version of formula every time. Soon i will upgrade to excel 365, and i will use also that version. Big thanks! – Losai Mar 14 '22 at 17:43
  • Just a small note for beginners like me: Be careful with formula in row builder section here `...(ROW(A$1:INDEX(A:A,LEN(A1))...`. to exclude some confusion is better to build row index in another cell address, like this: `...(ROW(X$1:INDEX(X:X,LEN(A1))...` – Losai Mar 15 '22 at 14:00
  • Also I suggest to put entire formula in IFERROR() function to exclude wrong returns of blank or trimmed cells. – Losai Mar 15 '22 at 14:07