4

So I have a long string containing pointy brackets that I wish to extract text parts from.

string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";

I want to be able to get this

1 = "text1"
27 = "text27"
3 = "text3"

How would I obtain this easily? I haven't been able to come up with a non-hacky way to do it.

Thanks.

Ian
  • 30,182
  • 19
  • 69
  • 107
DiscoPogo
  • 53
  • 3
  • 2
    Use HTML/XML Parser. – Fᴀʀʜᴀɴ Aɴᴀᴍ Dec 23 '15 at 16:35
  • Well, these look like XML tags, you could wrap that string into a container element then load it into an `XElement` or `XDocument`. `XDocument doc = XDocument.Parse(string.Format("{0}", exampleString ));` – gmiley Dec 23 '15 at 16:35
  • An XML parser won't work without some pre-processing, as XML element names must start with a letter or underscore. But you could do a search and replace to prepend a letter to the names, then wrap the whole thing in a container element as @gmiley suggests and parse it as XML. – adv12 Dec 23 '15 at 16:39
  • I would write a linq query for the string and parse the characters to determine the parts of each "xml" like this. XMLTagStart, XMLTagName, XMLTagStartEnd, XMLContent, XMLTagEndStart, XMLTagName2, XMLTagEndEnd. So create a class that has those names, parse the string and fill in the collection of that class. From there you can do this: var myContent = MyXMLCollection.Select(p->p.XMLContent).ToList(); – JWP Dec 23 '15 at 16:43
  • Yep, wasn't even thinking about the number part of it. As you said though, the numbers could be replaced, and that could be done with a regular express match/replace. In fact, both the number and value could be extracted using regular expressions. – gmiley Dec 23 '15 at 16:43
  • You could start by splitting the string on `""`, or is that too hacky? – Graham Dec 23 '15 at 16:46
  • Assuming the data format is as he shows, we can do it by "string.Replace" – Ian Dec 23 '15 at 17:06

2 Answers2

6

Using basic XmlReader and some other tricks to do wrapper to create XML-like data, I would do something like this

string xmlString = "<1>text1</1><27>text27</27><3>text3</3>";
xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>";
string key = "";
List<KeyValuePair<string,string>> kvpList = new List<KeyValuePair<string,string>>(); //assuming the result is in the KVP format
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString))){
    bool firstElement = true;
    while (xmlReader.Read()) {
        if (firstElement) { //throwing away root
            firstElement = false;
            continue;
        }
        if (xmlReader.NodeType == XmlNodeType.Element) {
            key = xmlReader.Name.Substring(1); //cut of "o"
        } else if (xmlReader.NodeType == XmlNodeType.Text) {
            kvpList.Add(new KeyValuePair<string,string>(key, xmlReader.Value));
        }
    }
}

Edit:

The main trick is this line:

xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>"; //wrap to make this having single root, o is put to force the tagName started with known letter (comment edit suggested by Mr. chwarr)

Where you first replace all opening pointy brackets with itself + char, i.e.

<1>text1</1> -> <o1>text1<o/1> //first replacement, fix the number issue 

and then reverse the sequence of all the opening point brackets + char + forward slash to opening point brackets + forward slash + char

<o1>text1<o/1> -> <o1>text1</o1> //second replacement, fix the ending tag issue

Using simple WinForm with RichTextBox to print out the result,

for (int i = 0; i < kvpList.Count; ++i) {
    richTextBox1.AppendText(kvpList[i].Key + " = " + kvpList[i].Value + "\n");
}

Here is the result I get:

enter image description here

Ian
  • 30,182
  • 19
  • 69
  • 107
  • First of, I didn't mean to get the exact output like that, I probably worded it badly, but more like when I enter "1" it would output "text1" etc. Which leads me to the next issue, I don't see how I would be able to do that using your method. Regardless, I appreciate your effort. – DiscoPogo Dec 23 '15 at 17:17
  • 1
    In that case, maybe you want to change my `List>` with `Dictionary`. so that you can call `dict["1"]` to get `"text1"`. You may want to try it out! It is quite fun! =) It is better to read such file once and use it many times than read it multiple times to use multiple times too. – Ian Dec 23 '15 at 17:21
  • Works perfectly, thank you very much. My other solution was using Regex.Match, but that was a super hacky way compared to this. – DiscoPogo Dec 23 '15 at 17:26
  • Glad to know that. =) Regex normally shows is string when the text size is a lot larger and the pattern is more difficult. Maybe this pattern is just not large and complicated enough for Regex to show its strength... – Ian Dec 23 '15 at 17:27
1

This is far from bulletproof, but you could use a combination of split and Regex matching:

string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";

string[] results = exampleString.Split(new string[] { "><" }, StringSplitOptions.None);

Regex r = new Regex(@"^<?(\d+)>([^<]+)<");

foreach (string result in results)
{
    Match m = r.Match(result);
    if (m.Success)
    {
        string index = m.Groups[1].Value;
        string value = m.Groups[2].Value;

    }
}

The most non-bulletproof example I can think of is if your text contains a "<", that would pretty much break this.

Hambone
  • 15,600
  • 8
  • 46
  • 69
  • @Fredou -- understood and agreed; that's actually one of my all-time favorite posts. When I say "not bulletproof," I meant it. – Hambone Dec 23 '15 at 20:07