Extract text from <1> (HTML/XML-Like but with Number Tag)

Question

So I have a long string containing pointy brackets that I wish to extract text parts from.

string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";

I want to be able to get this

1 = "text1"
27 = "text27"
3 = "text3"

How would I obtain this easily? I haven't been able to come up with a non-hacky way to do it.

Thanks.

Well, these look like XML tags, you could wrap that string into a container element then load it into an `XElement` or `XDocument`. `XDocument doc = XDocument.Parse(string.Format("{0}", exampleString ));` — gmiley, Dec 23 '15 at 16:35
An XML parser won't work without some pre-processing, as XML element names must start with a letter or underscore. But you could do a search and replace to prepend a letter to the names, then wrap the whole thing in a container element as @gmiley suggests and parse it as XML. — adv12, Dec 23 '15 at 16:39
I would write a linq query for the string and parse the characters to determine the parts of each "xml" like this. XMLTagStart, XMLTagName, XMLTagStartEnd, XMLContent, XMLTagEndStart, XMLTagName2, XMLTagEndEnd. So create a class that has those names, parse the string and fill in the collection of that class. From there you can do this: var myContent = MyXMLCollection.Select(p->p.XMLContent).ToList(); — JWP, Dec 23 '15 at 16:43
Yep, wasn't even thinking about the number part of it. As you said though, the numbers could be replaced, and that could be done with a regular express match/replace. In fact, both the number and value could be extracted using regular expressions. — gmiley, Dec 23 '15 at 16:43
You could start by splitting the string on `""`, or is that too hacky? — Graham, Dec 23 '15 at 16:46
Assuming the data format is as he shows, we can do it by "string.Replace" — Ian, Dec 23 '15 at 17:06

Ian · Accepted Answer · 2015-12-30T02:38:28.087

Using basic XmlReader and some other tricks to do wrapper to create XML-like data, I would do something like this

string xmlString = "<1>text1</1><27>text27</27><3>text3</3>";
xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>";
string key = "";
List<KeyValuePair<string,string>> kvpList = new List<KeyValuePair<string,string>>(); //assuming the result is in the KVP format
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString))){
    bool firstElement = true;
    while (xmlReader.Read()) {
        if (firstElement) { //throwing away root
            firstElement = false;
            continue;
        }
        if (xmlReader.NodeType == XmlNodeType.Element) {
            key = xmlReader.Name.Substring(1); //cut of "o"
        } else if (xmlReader.NodeType == XmlNodeType.Text) {
            kvpList.Add(new KeyValuePair<string,string>(key, xmlReader.Value));
        }
    }
}

Edit:

The main trick is this line:

xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>"; //wrap to make this having single root, o is put to force the tagName started with known letter (comment edit suggested by Mr. chwarr)

Where you first replace all opening pointy brackets with itself + char, i.e.

<1>text1</1> -> <o1>text1<o/1> //first replacement, fix the number issue

and then reverse the sequence of all the opening point brackets + char + forward slash to opening point brackets + forward slash + char

<o1>text1<o/1> -> <o1>text1</o1> //second replacement, fix the ending tag issue

Using simple WinForm with RichTextBox to print out the result,

for (int i = 0; i < kvpList.Count; ++i) {
    richTextBox1.AppendText(kvpList[i].Key + " = " + kvpList[i].Value + "\n");
}

Here is the result I get:

First of, I didn't mean to get the exact output like that, I probably worded it badly, but more like when I enter "1" it would output "text1" etc. Which leads me to the next issue, I don't see how I would be able to do that using your method. Regardless, I appreciate your effort. — DiscoPogo, Dec 23 '15 at 17:17
In that case, maybe you want to change my `List>` with `Dictionary`. so that you can call `dict["1"]` to get `"text1"`. You may want to try it out! It is quite fun! =) It is better to read such file once and use it many times than read it multiple times to use multiple times too. — Ian, Dec 23 '15 at 17:21
Works perfectly, thank you very much. My other solution was using Regex.Match, but that was a super hacky way compared to this. — DiscoPogo, Dec 23 '15 at 17:26
Glad to know that. =) Regex normally shows is string when the text size is a lot larger and the pattern is more difficult. Maybe this pattern is just not large and complicated enough for Regex to show its strength... — Ian, Dec 23 '15 at 17:27

score 1 · Answer 2 · answered Dec 23 '15 at 17:33

This is far from bulletproof, but you could use a combination of split and Regex matching:

string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";

string[] results = exampleString.Split(new string[] { "><" }, StringSplitOptions.None);

Regex r = new Regex(@"^<?(\d+)>([^<]+)<");

foreach (string result in results)
{
    Match m = r.Match(result);
    if (m.Success)
    {
        string index = m.Groups[1].Value;
        string value = m.Groups[2].Value;

    }
}

The most non-bulletproof example I can think of is if your text contains a "<", that would pretty much break this.

@Fredou -- understood and agreed; that's actually one of my all-time favorite posts. When I say "not bulletproof," I meant it. — Hambone, Dec 23 '15 at 20:07

Extract text from <1> (HTML/XML-Like but with Number Tag)

2 Answers2