0

I use a function to get the first "x" words of a string. Main part is:

preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);

When a word is inside html - example:

<a href="/"><u>Linktext</u></a>

The regex see the word "linktext" as a word. Regex should be changed to skip every word that is inside a html tag.

Is this possible?

Jan
  • 42,290
  • 8
  • 54
  • 79
labu77
  • 605
  • 1
  • 9
  • 30
  • So do you want all text **outside** of html tags? – Jan Feb 06 '16 at 08:43
  • @user2057781 Try this `(?<!\>)\b(<\/?([\w+]+)[^>]*>)?([^<>]*)\b(?!\<)` – tchelidze Feb 06 '16 at 08:52
  • @tchelidze this removes every html – labu77 Feb 06 '16 at 08:59
  • 2
    Please provide some more input strings. – Jan Feb 06 '16 at 09:32
  • what do you mean with input strings? – labu77 Feb 06 '16 at 10:09
  • You can skip something by using [verbs (*SKIP)(*F)](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs): `>[^<]+<(*SKIP)(*F)|\w+`. [Try on regex101](https://regex101.com/r/xQ1fG6/1). – bobble bubble Feb 06 '16 at 11:22
  • @bobblebubble thanks for you input. sorry I dont understand regex. Would you please add me your code into my regex? thank you – labu77 Feb 06 '16 at 11:25
  • 1
    Please provide better/more input samples in your question and what exactly you want to achieve. – bobble bubble Feb 06 '16 at 11:30
  • @user2057781, If I understood you correctly, with the following exemplary input text: `'Link textsome texthiddenanother text'` the final output should be: `'some textanother text'`? Right ? – RomanPerekhrest Feb 06 '16 at 11:32
  • @RomanPerekhrest no thats no correct, I try it again: I use a function to trim / truncate / cut a long string but keep html. The function (from cakephp) works great but when a word is inside a html tag, it becomes also trimed. That means: some text Link text some text becomes to some text Link te... but I need it like this: some text... Words inside of html should be skiped while html should be presaved (the html save is working) – labu77 Feb 06 '16 at 11:42
  • 1
    This question you refer to was removed, unfortunately. You should provide the complete explanation in the question here. – Armali Apr 07 '16 at 07:54

1 Answers1

0

Use XSL transformations. I used template from related answer (How to remove all text from an XML document):

$string = '<a href="/">Some text <u>Linktext</u> more text</a>';
$xslTemplate = '<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
  <!-- copy all nodes -->
  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  <!-- clear attributes -->
  <xsl:template match="@*">
    <xsl:attribute name="{name()}" />
  </xsl:template>
  <!-- ignore text content of nodex -->
  <xsl:template match="text()" />
</xsl:stylesheet>';

libxml_use_internal_errors(true);

$inputDom = new DOMDocument();
$inputDom->loadHTML($string);

$xslDom = new DOMDocument();
$xslDom->loadXML($xslTemplate);

$cp = new XSLTProcessor();
$cp->registerPHPFunctions();
$cp->importStylesheet($xslDom);

$transformedResult = $cp->transformToDoc($inputDom);
$transformedHtmlString = $transformedResult->saveXML($transformedResult->getElementsByTagName('body')->item(0));

$transformedHtmlString = str_replace('<body>','', $transformedHtmlString); //saveXml() method leaves automatically created body tag
$transformedHtmlString = str_replace('</body>','', $transformedHtmlString);
echo $transformedHtmlString;
Community
  • 1
  • 1
Aleksey Ratnikov
  • 569
  • 3
  • 11
  • I dont want use strip_tags. I need the html in the string. – labu77 Feb 06 '16 at 10:58
  • I see now, see corrected answer. Beware also of HEREDOC enclosed tag 'XML;' that should be part of code of course. – Aleksey Ratnikov Feb 06 '16 at 11:27
  • maybe this sounds dump to you but I have only php files to work with the text. when I add this xml code to my php file I have syntax errors – labu77 Feb 06 '16 at 11:32
  • This is because of missed HEREDOC enclosing tag i mentioned in comment above. I've converted HEREDOC to plain string to avoid incorrect parser behavior. – Aleksey Ratnikov Feb 06 '16 at 11:47