How to write a custom tokenizer for sklearn's CountVectorizer to treat all XML tags and all text between open and closed tags as tokens

Question

I have data in this form:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">
<m:math display="inline"><m:semantics><m:mrow><m:mrow><m:mi>s</m:mi><m:mo>⁢</m:mo><m:mfenced close=")" open="("><m:mrow><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>1</m:mn></m:mrow></m:mfenced></m:mrow><m:mo>=</m:mo><m:mrow><m:mi>S</m:mi><m:mo>⁢</m:mo><m:mfenced close=")" open="("><m:mrow><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>1</m:mn></m:mrow></m:mfenced></m:mrow><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:annotation-xml encoding="MathML-Content"><m:apply><m:ci></m:ci><m:apply><m:times></m:times><m:ci>s</m:ci><m:apply><m:interval closure="open"></m:interval><m:cn>1</m:cn><m:cn>1</m:cn></m:apply></m:apply><m:eq></m:eq><m:apply><m:times></m:times><m:ci>S</m:ci><m:apply><m:interval closure="open"></m:interval><m:cn>1</m:cn><m:cn>1</m:cn></m:apply></m:apply><m:eq></m:eq><m:cn>1</m:cn></m:apply></m:annotation-xml></m:semantics></m:math>
</html>

what I am trying to do is write a function that tokenizes this data into a list of tokens where each token would be either an XML tag or the text between an opening and a closing tag.

For example, I would expect the function to get <m:math display="inline">, <m:semantics>, and <m:mrow>, as well as the s between <m:mi>s</m:mi> (the m:mi tags as well should be treated as tokens). This function would be passed as a parameter to the CountVectorizer sklearn method as the tokenizer parameter.

The expected tokens for this input sequence would be of this form: ["<m:math display="inline">", "<m:semantics>", "<m:mrow>", "<m:mrow>","<m:mi>", "s", "</m:mi>", ..., "</m:math>"]

I was initally trying to use a regex but it seems that there must be a much simpler way to do this using a method, but I am kind of lost as to how this method should behave. Any help is appreciated!

I'd [use an XML parser](https://stackoverflow.com/q/1732348/6243352). — ggorlen, Jul 19 '21 at 15:38
I would use an XML parser but I need to do it in regex because I need to pass the pattern as a parameter to a scikit-learn method — Zein, Jul 19 '21 at 15:49
Seems like an [XY problem](https://meta.stackexchange.com/a/233676/399876). If you expand on what you're really trying to achieve as an [edit](https://stackoverflow.com/posts/68443296/edit) to your post, someone can probably show you a better way, because what you're embarking on (parsing recursive structures with regex) is very likely a world of pain. — ggorlen, Jul 19 '21 at 15:51
Ok I have edited the post to clarify exactly what I am trying to do — Zein, Jul 19 '21 at 16:14
Thanks. Can you show the exact tokens you expect as output? I'm not an sklearn person, but the tokenizer is a function, so that means you're not limited to regex and you can use an xml parser. BTW, it's good to leave your regex attempt in there, so I'd re-add that to avoid your question being closed as "too broad", but provide the context as well so others can show a better way. — ggorlen, Jul 19 '21 at 16:19
For the input in the post I would expect all the individual tags as tokens as well as the text between an opening and closing tag, so something like: ```["", "", "", "","", "s", "", ..., ""]``` — Zein, Jul 19 '21 at 16:21
I'd make that an edit to the post. Also, if this `m:` format is a well-known standard, it's good to mention that as well. If it is, 99% of the time there's already a package that does what you need. — ggorlen, Jul 19 '21 at 16:22
I believe the m: format is from MathML but I am not 100% certain — Zein, Jul 19 '21 at 16:27

score 0 · Answer 1 · answered Jul 24 '21 at 18:15

I've managed to solve the problem without using regex or an XML parser by making use of python's string.split() method and splitting the XML input string by closing bracket ">" (and adding it back to the string). Then I just scan the list and for every string that doesn't start with an opening bracket "<", I split the string on the opening bracket (and add it back) into the word and the tag and just add them to the main list of tokens at the correct index.

How to write a custom tokenizer for sklearn's CountVectorizer to treat all XML tags and all text between open and closed tags as tokens

1 Answers1