Extracting Strings -- if not Regex, then what?

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a file containing about 2000 lines such as this:

<nobr>&nbsp;&nbsp;&nbsp;&nbsp;<a href="../Carbon_Monoxide_Poisoning_Prevention.htm"><b>poisoning - prevention</b></a></nobr><br>
<nobr>&nbsp;&nbsp;&nbsp;&nbsp;<a href="../Carbon_Monoxide_Symptoms.htm"><b>symptoms</b></a></nobr><br>

1.) the URL is ALWAYS in the form of ../foo.html

2.) the display name is SOMETIMES enclosed with <b> ... </b> tags, and sometimes not.

3.) each line in the file contains up to four   that I need to count and flag as spaces. These will EVENTUALLY be used to format indents, so I need to capture the information somehow.

I need to have the hyperlink, display name and number of spaces name in a delimited flat file as follows (based on the above data):

../Carbon_Monoxide_Poisoning_Prevention.htm,poisoning - prevention,4
../Carbon_Monoxide_Symptoms.htm,symptoms,4

. While I can parse this through a whole mess of String, substring, and if statements, that seems to be more cumbersome than it needs to be. I was investigating Regex (my first time doing so), but am a little unclear on some of the syntax; I learn best seeing a code sample similar to my applications, but have not been able to find examples of anything that quite fits.

Any help would be appreciated!

Regex can be use to parse a limited subset of HTML. In general is a bad idea. — Lukasz Madon, Jun 14 '12 at 15:07
Is it important to capture the bold tags, or are you only mentioning the tags because they will affect the regex? (Do you only want the text, or the text with bold if bold is present?) — Platinum Azure, Jun 14 '12 at 15:08
to reiterate @lukas, [don't use regex to parse html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Evan Davis, Jun 14 '12 at 15:09
@lukas: Assuming the order is preserved (` ` before link before text) this should be pretty easy to parse with regex. I agree it's a bad idea though :-) — Platinum Azure, Jun 14 '12 at 15:09
@PlatinumAzure - text ONLY, bold not needed, but mentioned because I can't rely on the text being preceded by — dwwilson66, Jun 14 '12 at 15:12
Ok, see @MK.'s answer then. Regex is only used to match patterns; it's not going to count spaces for you or return formatted data like that. — Evan Davis, Jun 14 '12 at 15:12
@lukas - likely why I've not been able to find good examples of code. :) — dwwilson66, Jun 14 '12 at 15:19
Thanks, @Downvoter. :| The Java Guides say "Matcher: An engine that performs match operations on a character sequence by interpreting a Pattern. & Pattern: A compiled representation of a regular expression. It would appear to me that I'm trying to parse a line of text characters, based on patterns that appear in the text. Fine, I get it...it's possible, but caveat emptor. Is that a reason to downvote? — dwwilson66, Jun 14 '12 at 15:57

score 0 · Answer 1 · answered Jun 14 '12 at 15:09

0

If any counting of things is needed as an ouput you should not (and probably can not) use regular expressions. In general if what you are trying to do is described by an algorithm, you should program it. If what you are trying to do is described as "I'm looking for a string/substring that looks like..." regular expression might be a good idea.

answered Jun 14 '12 at 15:09

MK.

33,605
18
74
111

It can sometimes be a good idea to use regex to tokenize and then do further post-processing on relevant parts of the data, however. – Platinum Azure Jun 14 '12 at 15:20
1

@PlatinumAzure it's a matter of personal preference. I personally think that regular expressions are good for quick one-off file trasformations, for manual text editing (search and replace) and for exposing to the end user as a configuration parameter of your app. But usually not good in the middle of your high level code (Java, C++ etc). – MK. Jun 14 '12 at 15:22

score 0 · Answer 2 · answered Jun 14 '12 at 15:12

I wouldn't say regex, but you may be able to avoid writing a whole program by using a scripting language. There are some tools in Bash/Perl/Powershell/etc. that seem like they would work better for your purpose. Then you can still use tools like grep to leverage the power of regular expressions mixed with other tools, data structures, conditionals, etc. In addition, if you're going to be working through heavy HTML, there are tools out there you could pipe to or call to make your life easier.

score 0 · Answer 3 · edited May 23 '17 at 12:18

0

Regular expression parsing HTML is not appropriate because it isn't a regular language. How many times does this have to be asked? besides regular expressions aren't a programming language, you can't do the counting and book keeping you want to do, they are for matching patterns in a regular language.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

Regular expressions are specialized tools, they aren't hammers to beat in every nail that looks like a String that needs to be pattern matched or searched or otherwise manipulated.

Jeff Atwood has a good discussion of the cons and pros of regular expressions, if you don't know a lot about them, read what he has to say before you try and wield them.

edited May 23 '17 at 12:18

Community

1
1

answered Jun 14 '12 at 15:37

Well, apparently it needs to be asked one more time because I'm not familiar with Regex enough to know that I can't use that package to search for patterns of text IF that text happens to be HTML. But can't is pretty strong...the post to which you linked includes many answers that say yeah, you CAN do it, but caveat emptor. I'm just trying to understand a good way to accomplish what I need to--and based on my research, regex seems most appropriate. – dwwilson66 Jun 14 '12 at 15:56
1

you can until you can't, so why start down a path that ultimate leads to failure. You can dig in the dirt with a plastic spoon, until you need to dig a ditch. I posted the link to "regular languages" to explain what that means. HTML/XML isn't a regular language, you might be able to parse a few specific fragments, but ultimately you will encounter something that you can't do, there is no reason to start doing a worse practice, especially as a beginner, then have to undo what you "know" later. – Jun 14 '12 at 16:00

score 0 · Accepted Answer · answered Jun 14 '12 at 16:02

You can only grab one thing at a time, all the URLs at once, display name, or the spaces. I would not use regular expressions per say to do this, but here is how I would go about it if I absolutely had to use regex:

To grab the url in a line: \.\./.*\.html?

To grab the display name: (?<=("|b)>)[a-ZA-Z].+?(?=(</(a|b)))

To grab the spaces (simply):  

I would first split the file by the <br> tag to get the individual lines. And the run the regex above to pull out the url, display name, and spaces and combine them in a delimited output. I'm sure Java has a preg_match_all equivalent to match all patterns found (would be useful for the spaces & counting them)

Note that these patterns were tested in Sublime Text and probably will not work in Java without a little tweaking. I can modify my answer later to include the Java if needed, but for a one-off thing like this you may be better off using Python or some other scripting language.

Best of luck!

score -1 · Answer 5 · answered Jun 14 '12 at 15:19

-1

Regex would be a correct way to approach this. As well as a string tokenizer (for counting spaces). You will have to use substrings though as a way of moving your way through the original string.

Here are some links (that contain examples) on Patterns and Tokenizers

answered Jun 14 '12 at 15:19

ShWebb

39
3

After a little more thought, I think you could probably do this by just using a String tokenizer.... – ShWebb Jun 14 '12 at 16:33

Extracting Strings -- if not Regex, then what?

5 Answers5