I have data in this form:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg">
<m:math display="inline"><m:semantics><m:mrow><m:mrow><m:mi>s</m:mi><m:mo></m:mo><m:mfenced close=")" open="("><m:mrow><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>1</m:mn></m:mrow></m:mfenced></m:mrow><m:mo>=</m:mo><m:mrow><m:mi>S</m:mi><m:mo></m:mo><m:mfenced close=")" open="("><m:mrow><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>1</m:mn></m:mrow></m:mfenced></m:mrow><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:annotation-xml encoding="MathML-Content"><m:apply><m:ci></m:ci><m:apply><m:times></m:times><m:ci>s</m:ci><m:apply><m:interval closure="open"></m:interval><m:cn>1</m:cn><m:cn>1</m:cn></m:apply></m:apply><m:eq></m:eq><m:apply><m:times></m:times><m:ci>S</m:ci><m:apply><m:interval closure="open"></m:interval><m:cn>1</m:cn><m:cn>1</m:cn></m:apply></m:apply><m:eq></m:eq><m:cn>1</m:cn></m:apply></m:annotation-xml></m:semantics></m:math>
</html>
what I am trying to do is write a function that tokenizes this data into a list of tokens where each token would be either an XML tag or the text between an opening and a closing tag.
For example, I would expect the function to get <m:math display="inline">
, <m:semantics>
, and <m:mrow>
, as well as the s between <m:mi>s</m:mi>
(the m:mi tags as well should be treated as tokens). This function would be passed as a parameter to the CountVectorizer sklearn method as the tokenizer parameter.
The expected tokens for this input sequence would be of this form:
["<m:math display="inline">", "<m:semantics>", "<m:mrow>", "<m:mrow>","<m:mi>", "s", "</m:mi>", ..., "</m:math>"]
I was initally trying to use a regex but it seems that there must be a much simpler way to do this using a method, but I am kind of lost as to how this method should behave. Any help is appreciated!