Trouble parsing a string containing a file

Question

I have a file which i'm parsing out myself. Every time i spot a "<" or ">" i split the string like so:

xml = file.split("[<>]");

This will give me the tag, the data and the closing tag.

once this is done i determine what type of tag it is and handle it differently. In the case that it is an item tag, it has a description. like so:

<description>
<![CDATA[
<img width="460" height="259" src="http://www.cbc.ca/gfx/images/news/topstories/2012/03/28/hi-parliament-stop-852-7931-6col.jpg"><br/><p>Finance Minister Jim Flaherty's budget will take the shine off what critics call MPs' gold-plated pensions, reports Greg Weston for CBC News.</p>
]]>
</description>

the problem here is that it will split on all the "<" and ">" so the description part that i'm looking for gets lost.

how can i get around the description and possibly other tags i'm searching for containing multiple "<" and ">" which i don't care about? (ones which are not surrounding the opening tag and closing tag?

If this is actually XML, why aren't you using an XML parser? — Jon Skeet, Mar 29 '12 at 05:44
i want to learn how to do it myself... thinking through it and parsing it myself — BigBug, Mar 29 '12 at 05:45
@BlueMonster: Fundamentally, it's a bad idea to parse XML using regular expressions. I'm sure there are much more *productive* things you could learn about. — Jon Skeet, Mar 29 '12 at 05:46
thanks... but...i'm not asking you for advice on what i should and shouldn't spend my time on... i'm asking for thoughts on a particular Q and a problem i'm stuck at...more specifically thoughts towards a solution — BigBug, Mar 29 '12 at 05:47
Well, throwing a parser at it *is* a solution to your problem. And one to your more fundamental problem of even trying. — Joey, Mar 29 '12 at 05:49
there are reasons why people use tried and true parsers and not go off coding their own. [<>] is a VERY VERY bad regex to use when parsing...just BAAAAD all around. if you want to teach yourself something new then listen to these people. You are tackling a parser in a poor fashion and already reaping the (mis)benefits of doing so. Look up how parsers/interpreters work, how the read documents on a low level and process that data. — Mike McMahon, Mar 29 '12 at 06:00
@BlueMonster Let's say we created a regex parser which could parse the example you give. We can GUARANTEE that we could then give it some valid XML which would break it. Every time you mend the parser we can find XML which breaks it. It will never stop. Even if you think your XML is so simple that it will always be parsable you will get an unexpected instance that breaks you. The very fact that you want to use CDATA guarantees it — peter.murray.rust, Mar 29 '12 at 07:14
Lol i love how so many of you just keep repeating what was already said. thanks, for nothing! i GET the point. none the less, i'm GOING to create my own parser. If there are people out there who can do it, i want to be one of those people. Of course, i'm JUST starting out. and it might take me a LONG time, but i'm GOING to do it!!!!! — BigBug, Mar 29 '12 at 07:24

score 3 · Accepted Answer · answered Mar 29 '12 at 05:50

3

If you want to learn how to write a good XML parser, then why not look at some open source XML parsers? Read the source, Luke!

answered Mar 29 '12 at 05:50

jhsowter

619
3
8

score 2 · Answer 2 · answered Mar 29 '12 at 07:06

One key difference between a proper parser and a regular expression is that a parser uses a stack so it can keep track of nested structures. Just splitting up on angle brackets gives you a flat list of strings with no indication of what elements are nested within what others; that's why it can't find the end tag that matches a given start tag.

Think about what happens if the XML file contains this:

<foo>
  <foo>
  </foo>
</foo>

When you see <foo>, you can't just look for the next </foo> and assume everything in between is the body.

What you need to do is when you see a start tag, push it onto a stack of elements that you're currently "within". When you see an end tag, check that it matches the topmost start tag on the stack. If it does, pop that tag from the stack — you're no longer within that element. If it doesn't match, signal an error; the input had <foo></bar> or something similar.

For HTML (as opposed to XML) it's more complex since some end tags are optional: <div><p></div> is not an error, for example. You could read the HTML spec and figure out all the rules and special cases, or you could just use one of the existing parser libraries that's available, and save yourself a lot of trouble.

Thanks for the advice Wyzard, that's actually helpful, unlike a lot of the other comments i've been receiving... — BigBug, Mar 29 '12 at 07:25
And this answer wouldn't be complete without a link the [most upvoted answer on all of StackOverflow](http://stackoverflow.com/a/1732454/226975). :-) — Wyzard, Mar 29 '12 at 07:34
You might be interested in using SAX or StAX, by the way, or at least studying their APIs for educational value. They take care of the lower-level parsing, so you can think of the file as a sequence of start tags and end tags rather than a sequence of characters, but they leave all the meaningful interpretation (and stack-related stuff) to the application. — Wyzard, Mar 29 '12 at 07:47
Lol thanks, i actually like one of the answers there, which talks about a limited "known" set of html tags - which is exactly what i'm doing... it's not a generic parser.. just one i'm making to capture specific data i'm looking for. Thanks again :) — BigBug, Mar 29 '12 at 07:48

score 1 · Answer 3 · answered Mar 29 '12 at 05:45

1

Trail: Java API for XML Processing and please forget the »let's split a string at [<>] as quickly as possible again.

answered Mar 29 '12 at 05:45

Joey

344,408
85
689
683

Despite your wanting, this answer remains the same. – Joey Mar 29 '12 at 05:47

Trouble parsing a string containing a file

3 Answers3