Wrap the Latin characters sequences up with the span tag

Question

I need the pattern for preg_replace to wrap all sequences of latin characters and digits within the whole HTML page with the tag <span class="text=arial"></span>. For example, the following HTML part

<a href="http://domain.com/path" target="_blank">GSPd 役に立つツール： スキル意欲マトリクス</a>

should be replaced with:

<a href="http://domain.com/path" target="_blank"><span class="text=arial">GSPd</span> 役に立つツール： スキル意欲マトリクス</a>

Obviously, only the inner node text should be processed in such way so replacement won't break HTML tags.

What I've tried:

$p = '#(?<=\>)([a-zA-Z0-9]+)(?=\<)#ium';
$html = preg_replace(
    $p,
    '><span class="text-arial">$0</span><',
    $html
);

This pattern should be extended to include situations when the content consists of mixed characters, e.g. GSPd 役に立つツール：スキル意欲マトリクス 100

I need 1 billion USD, but nobody gives me. :( Now to be constructive: what have you tried besides posting question on SO and how that fails? — Leri, Jun 09 '14 at 10:23
Glad the solution helps. :) For the full story I recommend you have a look at [the linked question](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#23589204), or save it for later, I had a lot of fun writing the answer. :) — zx81, Jun 09 '14 at 12:06

score 2 · Accepted Answer · edited May 23 '17 at 12:20

To match letters and digits while skipping text inside a <tag>, you can use the lovely (*SKIP)(*F) technique (available in Perl and PCRE) and be done without really breaking a sweat:

(?i)<[^>]*>(*SKIP)(*F)|[a-z][a-z ]+

On the demo, look at the Substitution section.

You can pop that into your preg_replace:

$regex = "~(?i)<[^>]*>(*SKIP)(*F)|[a-z][a-z ]+~";
$replace = '<span class="text=arial">\0</span>';
$replaced = preg_replace($regex,$replace,$original);

How does it work?

This is a situation where you want to exclude some content from being matched—in this case, tags. It is similar to this question about regex-matching a pattern unless..."

The left side of the alternation | matches complete < ... > tags, then deliberately fails, and the engine skips to the next position in the string. The right side matches "latin text" (which here I have defined as letters and spaces, which can be refined), and we know it is the right text because it was not matched by the expression on the left.

Further refinements

You can explore the [a-z][a-z ]+ and refine it till you are satisfied that it corresponds to your definition of "latin text".

Reference

many thanks for the answer - you approach works fine. Just one more question - what is the best way to exclude some tags from matching, such as — Igor Evstratov, Jun 09 '14 at 12:04

score 1 · Answer 2 · answered Jun 10 '14 at 09:26

I've adjusted zx81's approach a bit to avoid processing text of some tags like style or script:

    $regex = "~(?i)<(head|style|script|noscript)[^>]*?>.*?<\/.*?\\1>(*SKIP)(*F)|<[^>]*>(*SKIP)(*F)|[a-z0-9&][_a-z0-9&,.;:#%\-/\(\) ]*~smu";        
    $replace = '<span class="text-arial">\0</span>';
    $html = preg_replace($regex,$replace,$html);

What it was needed for

The client asked to use MS P Gothic for the Japanese characters and Arial for Latin ones. The thing is that the MS P Gothic font already has the Latin glyphs and to apply Arial for Latin characters they should be wrapped with some tag to have an ability for applying font-family: Arial via CSS. Adding spans manually is annoying, so many thanks to @zx81 for a good solution!

Wrap the Latin characters sequences up with the span tag

2 Answers2

Linked