Regular expression to split words with accented characters from latin

Question

I'm working on a html tool to study ancient latin language. There is one exercise where student have to click on some single word, in which there is a div with a piece of latin:

<div class="clickable">
                   Cum a Romanis copiis vincĭtur măr, Gallia terra fera est. 
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant. 
Etiam a_femĭnis vita agrestis agĭtur, 
miseras vestes induunt et cum familiā in parvis casis vivunt. 
Vita secūra nimiaeque divitiae a Gallis contemnuntur. 
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur. 
Galli densis silvis defenduntur, tamen Roma feram Galliam capit. 
</div>

In my javascript I wrap all single words into a <span> with a regex, and I apply some actions.

 var words = $('div.clickable');        
    words.html(function(index, oldHtml) {
        var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')

        return myText;
    }).click(function(event) { 
        if(!$(event.target).hasClass("word"))return; 
        alert($(event.target).text());
    }

The problem is that the words that contains ĭ, ŏ, ā, are not wrapped correctly, but are divided in correspondence of these characters.

How I can match correctly this class of words?

JS Fiddle

Try using [XRegExp](https://cdnjs.cloudflare.com/ajax/libs/xregexp/2.0.0/xregexp-all-min.js) — Wiktor Stribiżew, Apr 04 '16 at 07:10
See [this answer](http://stackoverflow.com/a/280762/160386) for more suggestions. — Jan Tojnar, Apr 04 '16 at 07:14

Slavik · Accepted Answer · 2016-04-04T08:01:25.767

4

You can split your text by divider. In common case it may be space or different punctuation marks:

(.+?)([\s,.!?;:)([\]]+)

https://regex101.com/r/xW4pF1/5

Edit

var words = $('div.clickable');        
words.html(function(index, oldHtml) {
    var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')

    return myText;
}).click(function(event) { 
    if(!$(event.target).hasClass("word"))return; 
    alert($(event.target).text());
}

https://jsfiddle.net/s568c0pp/3/

edited Apr 04 '16 at 08:01

answered Apr 04 '16 at 07:09

Slavik

6,647
3
15
18

This approach works fine, but match also the dividers, how to exclude punctations from result? – cesare Apr 04 '16 at 07:19
It matches 2 groups: word and divider. So in your replace function use both in replacement: `oldHtml.replace(/(.+?)([\s\,\.\!\?]+)/g, '$1$2')` https://jsfiddle.net/s568c0pp/2/ – Slavik Apr 04 '16 at 07:24

Sergey Moukavoztchik · Answer 2 · 2016-04-04T07:22:32.700

1

The \w meta character is used to find a word character from a-z, A-Z, 0-9, including the _ (underscore) character. So you need to change your regex to use the range of Unicode symbols instead of \w.

You also can try \p{L} instead of \w to match any Unicode character.

See also: http://www.regular-expressions.info/unicode.html

edited Apr 04 '16 at 07:22

answered Apr 04 '16 at 07:08

Sergey Moukavoztchik

61
6

I have tried with /\b(\p{L}+?)\b/g but doesn't match any word. – cesare Apr 04 '16 at 07:17
1

Sorry, JavaScript regex engine is little bit different. Give me couple of minutes... I'll check the alternative. – Sergey Moukavoztchik Apr 04 '16 at 07:29

Regular expression to split words with accented characters from latin

2 Answers2