0

I have made this very simplified version of a translation tool similar to Google Translate. The idea is to build this simple tool for a minority language in sweden called "jamska". The app is built up with a function that takes the string from a textarea with the ID #svenska and replaces words in the string using RegExp.

I've made an array called arr that's used in a for loop of the function as a dictionary. Each array item looks like this: var arr = [["eldröd", "eillrau"], ["oväder", "over"] ...]. The first word in each array item is in swedish, and the second word is in jamska. If the RegExp finds a matching word in the loop it replaces that word using this code:

function translate() {

var str = $("#svenska").val();
var newStr = "";
for (var i = 0; i < arr.length; i++) {
    var replace = arr[i][0];
    var replaceWith = arr[i][1];
    var re = new RegExp('(^|[^a-z0-9åäö])' + replace + '([^a-z0-9åäö]|$)', 'ig');
    str = str.replace(re, "$1" + replaceWith + '$2');
}

$("#jamska").val(str);

}

The translate() is then called in an event handler for when the #svenska textarea gets a keyup, like this: $("#svenska").keyup(function() { translate(); });

The translated string is then assigned as the value of another textarea with the ID #jamska. So far, so good.

I have a problem though: if the translated word in jamska also is a word in swedish, the function translates that word too. This problem is occurring because I'm assigning the variable str to the translated version of the same variable, using: str = str.replace(re, "$1" + replaceWith + '$2');. The function is using the same variable over and over again to perform the translation.

Example: The swedish word "brydd" is "fel" in jamska. "Fel" is also a word in swedish, so the word that I get after the translation is "felht", since the swedish word "fel" is "felht" in jamska.

Does anyone have any idea for how to work around this problem?

tobiasg
  • 983
  • 4
  • 17
  • 35
  • 1
    Restating your question and leaving out all specifics such as jamska, jquery and references to your DOM would make it more useful to future readers and help you attract better answers faster. – le_m Mar 28 '17 at 11:04
  • You *might* find your answer here: http://stackoverflow.com/questions/15604140/replace-multiple-strings-with-multiple-other-strings – le_m Mar 28 '17 at 11:09
  • @le_m You're right! Will think about that in the future. – tobiasg Mar 28 '17 at 11:31

1 Answers1

1

Instead of looking for each Jamska word in the input and replacing them with the respective translation, I would recommend to find any word ([a-z0-9åäö]+) in your text and replace this word either with its translation if one is found in the dictionary or with itself otherwise:

//var arr = [["eldröd", "eillrau"], ["oväder", "over"] ...]
// I'd better use dictionary instead of array to define your dictionary 
var dict = {
    eldröd: "oväder",
    eillrau: "over"
    // ...
};
var str = "eldröd test eillrau eillrau oväder over";
var translated = str.replace(/[a-z0-9åäö]+/ig, function(m) {
    var word = m.toLowerCase();
    var trans = dict[word];
    return trans === undefined ? word : trans;
});
console.log(translated);

Update:

If dictionary keys may be represented by phrases (i.e. technically appear as strings with spaces), the regex should be extended to include all these phrases explicitly. So the final regex would look like

(?:phrase 1|phrase 2|etc...)(?![a-z0-9åäö])|[a-z0-9åäö]+

It will try to match one of the phrases explicitly first and only then single words. The (?![a-z0-9åäö]) lookbehind helps to filter out phrases immediately followed by letters (e.g. varken bättre eller sämreåäö).

Phrases immediately preceded by letters are implicitly filtered out by the fact that a match is either the fist one (and therefore is not preceded by any letter) or it's not the first and therefore the previous one is separated from the current by some spaces.

//var arr = [["eldröd", "eillrau"], ["oväder", "over"] ...]
// I'd better use dictionary instead of array to define your dictionary 
var dict = {
    eldröd: "oväder",
    eillrau: "over",
    bättre: "better",
    "varken bättre eller sämre": "vär å int viller",
    "test test": "double test"
    // ...
};

var str = "eldröd test eillrau eillrau oväder over test test ";
str += "varken bättre eller sämre ";
str += "don't trans: varken bättre eller sämreåäö";
str += "don't trans again: åäövarken bättre eller sämre";

var phrases = Object.keys(dict)
    .filter(function(k) { return /\s/.test(k); })
    .sort(function(a, b) { return b.length - a.length; })
    .join('|');
var re = new RegExp('(?:' + phrases + ')(?![a-z0-9åäö])|[a-z0-9åäö]+', 'ig');

var translated = str.replace(re, function(m) {
    var word = m.toLowerCase();
    var trans = dict[word];
    return trans === undefined ? word : trans;
});
console.log(translated);
Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
  • Just tried out your code and it seems to work as expected. I'll try it out a bit and return! Big thanks! – tobiasg Mar 28 '17 at 11:23
  • @WiktorStribiżew: sorry, I don't quite get your point. Do you mean those messy `0-9` in the character class? They were inherited from the OP sample so I took them as an implicit requirement (though not quite clear to me). – Dmitry Egorov Mar 28 '17 at 11:26
  • @DmitryEgorov: Never mind, I am not sure then why the custom boundaries were used in OP regex at all then. – Wiktor Stribiżew Mar 28 '17 at 11:26
  • @WiktorStribiżew I use the custom boundaries because both swedish and jamska uses åäö. Regex seems to think that åäö are special characters and therefore I ran into problems when trying to translate words beginning or ending with åäö. – tobiasg Mar 28 '17 at 11:29
  • @DmitryEgorov In the code sample I provided the array was sorted by length in descending order so that longer phrases were prioritized over shorter phrases. For example, the sentence "varken bättre eller sämre" should be translated to "vär å int viller" as the array item was setup like: ["varken bättre eller sämre", "vär å int viller"]. With your code, the sentence get translated to "varken likar ell näppar", since there are objects that says {bättre: "likar"} etc. I'm aware that I should have provided this fact in my code, so I don't expect a solution from you :) But maybe someone knows? – tobiasg Mar 28 '17 at 11:47
  • Ah, sorry. I think this might be fixed by just putting quotation marks around the words having whitespaces, right? Like: {"varken bättre eller sämre": "vär å int viller"}? – tobiasg Mar 28 '17 at 11:52
  • @tobiasg: So the dictionary may contain phrases (i.e. series of words) too? Well, I need to adjust the answer if that is the case. – Dmitry Egorov Mar 28 '17 at 11:53
  • @tobiasg: I'm afraid the quotation mark won't help in this case. The solution was designed with an assumption of no spaces in the dictionary. I'll updated the answer to take this into account. – Dmitry Egorov Mar 28 '17 at 11:55
  • @DmitryEgorov Thank you so much! – tobiasg Mar 28 '17 at 12:38
  • @DmitryEgorov Just tried out the example and it worked perfectly. You da man! Thanks once again! – tobiasg Mar 28 '17 at 12:57