how find html tag in non-html file?

Question

I can't parse it because it is not a html file, it is a simple text and sometimes in it can be hidden a valid openings of html tags like:

<a href="..." >

but also:

<anytag par1="val1" par2='val2' par3=val3 />

and everything would be nice and easy if not this possibility:

<anytag param='square < brackets > in value' par2="and < another < such case" >

How to match this with regex ?

(This is not valid html, the tags are (may be) in a normal txt file, and are loose, that is not contained in any proper structure, and are not always closed. (But headers are of course always closed with >, look at the examples.) I'm not interested what is inside tag, but only in opening header.)

You should take a look at http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not and reconsider parsing HTML with regex — Jan Dragsbaek, Nov 18 '11 at 12:40
I don't quite get what you want the regex to match - 'tagname', in your example? — canavanin, Nov 18 '11 at 12:41
If this is valid html shouldn't the `<` and `>` be `<` and `>`? — fredley, Nov 18 '11 at 12:44
This is not a valid html, and tags are also not necessarily valid. This is why I cannot use a parser. But browsers aren't complaining if get unquoted brackets in tag parameters. — rsk82, Nov 18 '11 at 12:45
If it's not valid html, this is basically impossible as 'non-valid html' is an unknown language. Whatever regex you come up with it's going to be trivial to come up with a counter-example since 'non-valid html' could be anything at all. E.g. `<>> />> />/>/>>>>>>/>` — fredley, Nov 18 '11 at 12:49
I agree with fredley. Unless you can define what its going to look like in some terms you can't use a regular expression to match it. The best you can do is assume it is valid HTML and therefore try to use something like HTML agility pack to parse it and do its best to make sense of what you are given. — Chris, Nov 18 '11 at 12:54
It is not "that" invalid. If I said that it is invalid, i meant that the file is just text were somewhere are semi valid html tags. of course if there is no end bracket it is impossible to say where it ends but if it has a square brackets in values than there is a common thing in many programming languages how to write strings. Brower accept this. I have a hunch that you are playing on my words. Look at the question, these are the cases and nothing more. — rsk82, Nov 18 '11 at 13:16
[close] this question will likely solicit opinion, debate, arguments, polling, or extended discussion. — Michael Durrant, Nov 18 '11 at 13:24
@MichaelDurrant I agree. Also `not "that" invalid` is a hilarious concept. — fredley, Nov 18 '11 at 14:42
Ok, I've done it myself, thank you for the insightful comments. `<\s*[a-zA-Z0-9]+(?:\s+[a-zA-Z0-9]+\s*=\s*(?:\'[^\']+\'|\"[^\"]+\"|[a-zA-Z0-9_.]+)+)*\s*/?>` — rsk82, Nov 18 '11 at 15:02

Friend of Kim · Accepted Answer · 2011-11-19T18:28:14.173

0

Try something like this:

$regEx = "/(<[a-z A-Z]+(=\"[a-z A-z]\")*)+>/";

First, it checks if it contains 1 or more <, then it checks if it contains zero or more a-z="a-z", then one >.

edited Nov 19 '11 at 18:28

answered Nov 19 '11 at 18:03

Friend of Kim

850
9
24

I also used to have to ask others for regex stuff, but then I set out to learn it properly. Then I discovered that it isn't as hard as it seems. It's really handy to know how it works. [] means "find any of these". () means "find everything in here". * means "0 or more", + means "1 or more", ? means "0 or 1". . means "any character". Good luck :) – Friend of Kim Nov 19 '11 at 22:43

how find html tag in non-html file?

1 Answers1