1

Using PHP I need I will have an array of tag name => tag URL I need to somehow scan a text input (will be somewhat large, a blog post) and find all tag names in the text and replace them with the URL link. To complicate it though, if the tag name is inside <h1>, <h2>, or <code> and <pre> tags it will not do it. Possibly to simplify, I could say it has to be inside a <p> tag for the switch to take place.

I am not sure how to accomplish this, I know I will need regex but I am a bit lost at the moment, if anyone could help me some I would greatly appreciate it

so a PHP tag would be turned into <a href="link here">PHP</a>

JasonDavis
  • 48,204
  • 100
  • 318
  • 537
  • How did you come to the conclusion to use regex for this? Related to one of your previous questions: http://stackoverflow.com/questions/5628783/extract-data-from-a-google-chrome-bookmarks-export-with-php – mario Jan 17 '12 at 00:33

4 Answers4

3

You can use an XML parser like:

$array_of_tags = (array) simplexml_load_string($html);

OR

$xml_object = simplexml_load_string($html);

The first approach will give you your tags in a searchable array. The second will give you a SimpleXMLElement object.

You can then use a simple foreach loop to iterate over the elements in your array or reference the variables in your SimpleXMLElement object. Have a look at the simplexml_load_string tutorial by W3C it's very straight forward.

travega
  • 8,284
  • 16
  • 63
  • 91
  • 1
    I think this is the way to go, maybe in combination with rdlowrey's answer – JasonDavis Jan 17 '12 at 00:43
  • 2
    +1 because DOM is **ALWAYS** the preferred method for parsing/editing (X)HTML markup. However, if someone is looking for a simple solution and isn't comfortable with even basic looping or regex this may be a big ask. Also, I would suggest DOM over SimpleXML if replacement is needed because SimpleXML is optimized for reading and iteration, not DOM manipulation. –  Jan 17 '12 at 00:46
  • 1
    @rdlowrey Agreed. I guess there is some learning to be done either way but I'd always advocate putting in a bit of extra work to learn the "best practice" approach to avoid "relearning" in the future ;) – travega Jan 17 '12 at 00:51
  • I was just testing this out, it seems that you must wrap everything in some tag all the paragraphs and header etc have to be in a tag, the tag can be of any name, do you know if there is a way around this? – JasonDavis Jan 17 '12 at 01:10
  • I guess I could temporary add them so it works and then remove them – JasonDavis Jan 17 '12 at 01:10
  • @jasondavis I have clarified my post to give more details. – travega Jan 17 '12 at 01:11
  • @travega what I am saying is that the actual content passed to the `simplexml_load_string` must be wrapped in a parent node otherwise you will get an error like `Warning: simplexml_load_string(): Entity: line 6: parser error :Extra content at the end of the document in ` – JasonDavis Jan 17 '12 at 01:23
  • Well yeh if you are using a DOM parser you need to have valid (X)HTML. If you are missing closing tags the you will have an issue but if you are just missing wrapping tags then you can simply append them to either side of your input string. – travega Jan 17 '12 at 02:27
1

I wouldn't use regex (and I don't think you would be able to) but I think you just need to get down to brass tacks on this one. Do a foreach loop and keep booleans to keep track of when you are inside an <h1> <h2> <code> or <pre>, if you are and you find something that needs to be replaced then don't replace it, otherwise replace it. Does that make sense? I can get more detailed if you want. But travega's answer is the best.

jakx
  • 748
  • 5
  • 8
1

A simple loop will suffice here:

$post = 'My link to {tag1} is awesome, but not as awesome as my link to {tag2}';

$tags = array(
  'tag1' => 'http://tag1.com',
  'tag2' => 'http://tag2.com',
  'tag3' => 'http://tag3.com',
);

foreach ($tags as $tag_name => $tag_val) {
  $post = str_replace('{'.$tag_name.'}', "<a href='$tag_val'>$tag_name</a>", $post);
}

echo $post;
// outputs:
// My link to <a href='http://tag1.com'>tag1</a> is awesome, but not as awesome as my link to <a href='http://tag2.com'>tag2</a>
  • @kristian you're right but it is kind of a hack because it looks for `{tag1}` instead of `tag` so inside of those other tags you woulodn't do the `{tag1}` method – JasonDavis Jan 17 '12 at 00:40
  • 1
    @Kristian You are correct. I was simply trying to demonstrate that simple replacement can be managed without more serious solutions ... I would generally recommend **DOM** parsing, however. –  Jan 17 '12 at 00:47
1

I guess you excluded h1, h2, code and pre tags have no nesting, and if you do parsing on insert then i would do:

  1. preg_replace_callback with <(h1|h2|code|pre)>(.*?)</\1>, replacing them with placeholders, and stroing them to array as placeholder => html code
  2. strtr to replace tags
  3. strtr to replace placeholders with original code

Definetly isn't a brilliant solution, but doing this only on inserting post, this shouldn't be so bad..

Kristian
  • 3,283
  • 3
  • 28
  • 52