Get words from string - skip html

Question

I use a function to get the first "x" words of a string. Main part is:

preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);

When a word is inside html - example:

<a href="/"><u>Linktext</u></a>

The regex see the word "linktext" as a word. Regex should be changed to skip every word that is inside a html tag.

Is this possible?

@user2057781 Try this `(?<!\>)\b(<\/?([\w+]+)[^>]*>)?([^<>]*)\b(?!\<)` — tchelidze, Feb 06 '16 at 08:52
You can skip something by using [verbs (*SKIP)(*F)](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs): `>[^<]+<(*SKIP)(*F)|\w+`. [Try on regex101](https://regex101.com/r/xQ1fG6/1). — bobble bubble, Feb 06 '16 at 11:22
@bobblebubble thanks for you input. sorry I dont understand regex. Would you please add me your code into my regex? thank you — labu77, Feb 06 '16 at 11:25
Please provide better/more input samples in your question and what exactly you want to achieve. — bobble bubble, Feb 06 '16 at 11:30
@user2057781, If I understood you correctly, with the following exemplary input text: `'Link textsome texthiddenanother text'` the final output should be: `'some textanother text'`? Right ? — RomanPerekhrest, Feb 06 '16 at 11:32
@RomanPerekhrest no thats no correct, I try it again: I use a function to trim / truncate / cut a long string but keep html. The function (from cakephp) works great but when a word is inside a html tag, it becomes also trimed. That means: some text Link text some text becomes to some text Link te... but I need it like this: some text... Words inside of html should be skiped while html should be presaved (the html save is working) — labu77, Feb 06 '16 at 11:42
This question you refer to was removed, unfortunately. You should provide the complete explanation in the question here. — Armali, Apr 07 '16 at 07:54

score 0 · Answer 1 · edited May 23 '17 at 12:31

Use XSL transformations. I used template from related answer (How to remove all text from an XML document):

$string = '<a href="/">Some text <u>Linktext</u> more text</a>';
$xslTemplate = '<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
  <!-- copy all nodes -->
  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  <!-- clear attributes -->
  <xsl:template match="@*">
    <xsl:attribute name="{name()}" />
  </xsl:template>
  <!-- ignore text content of nodex -->
  <xsl:template match="text()" />
</xsl:stylesheet>';

libxml_use_internal_errors(true);

$inputDom = new DOMDocument();
$inputDom->loadHTML($string);

$xslDom = new DOMDocument();
$xslDom->loadXML($xslTemplate);

$cp = new XSLTProcessor();
$cp->registerPHPFunctions();
$cp->importStylesheet($xslDom);

$transformedResult = $cp->transformToDoc($inputDom);
$transformedHtmlString = $transformedResult->saveXML($transformedResult->getElementsByTagName('body')->item(0));

$transformedHtmlString = str_replace('<body>','', $transformedHtmlString); //saveXml() method leaves automatically created body tag
$transformedHtmlString = str_replace('</body>','', $transformedHtmlString);
echo $transformedHtmlString;

I see now, see corrected answer. Beware also of HEREDOC enclosed tag 'XML;' that should be part of code of course. — Aleksey Ratnikov, Feb 06 '16 at 11:27
maybe this sounds dump to you but I have only php files to work with the text. when I add this xml code to my php file I have syntax errors — labu77, Feb 06 '16 at 11:32
This is because of missed HEREDOC enclosing tag i mentioned in comment above. I've converted HEREDOC to plain string to avoid incorrect parser behavior. — Aleksey Ratnikov, Feb 06 '16 at 11:47

Get words from string - skip html

1 Answers1