1

I'm working on a html tool to study ancient latin language. There is one exercise where student have to click on some single word, in which there is a div with a piece of latin:

<div class="clickable">
                   Cum a Romanis copiis vincĭtur măr, Gallia terra fera est. 
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant. 
Etiam a_femĭnis vita agrestis agĭtur, 
miseras vestes induunt et cum familiā in parvis casis vivunt. 
Vita secūra nimiaeque divitiae a Gallis contemnuntur. 
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur. 
Galli densis silvis defenduntur, tamen Roma feram Galliam capit. 
</div>    

In my javascript I wrap all single words into a <span> with a regex, and I apply some actions.

 var words = $('div.clickable');        
    words.html(function(index, oldHtml) {
        var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')

        return myText;
    }).click(function(event) { 
        if(!$(event.target).hasClass("word"))return; 
        alert($(event.target).text());
    }

The problem is that the words that contains ĭ, ŏ, ā, are not wrapped correctly, but are divided in correspondence of these characters.

How I can match correctly this class of words?

JS Fiddle

cesare
  • 2,098
  • 19
  • 29

2 Answers2

4

You can split your text by divider. In common case it may be space or different punctuation marks:

(.+?)([\s,.!?;:)([\]]+)

https://regex101.com/r/xW4pF1/5

Edit

var words = $('div.clickable');        
words.html(function(index, oldHtml) {
    var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')

    return myText;
}).click(function(event) { 
    if(!$(event.target).hasClass("word"))return; 
    alert($(event.target).text());
}

https://jsfiddle.net/s568c0pp/3/

Slavik
  • 6,647
  • 3
  • 15
  • 18
  • This approach works fine, but match also the dividers, how to exclude punctations from result? – cesare Apr 04 '16 at 07:19
  • It matches 2 groups: word and divider. So in your replace function use both in replacement: `oldHtml.replace(/(.+?)([\s\,\.\!\?]+)/g, '$1$2')` https://jsfiddle.net/s568c0pp/2/ – Slavik Apr 04 '16 at 07:24
1

The \w meta character is used to find a word character from a-z, A-Z, 0-9, including the _ (underscore) character. So you need to change your regex to use the range of Unicode symbols instead of \w.

You also can try \p{L} instead of \w to match any Unicode character.

See also: http://www.regular-expressions.info/unicode.html