1

In my web application I am using two different Languages namely English and Arabic.

I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user

Explanation:

Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.

The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.

I wish that my query should find both are same hometown and retrieve the values. Is it possible?

What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?

EDIT: *I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
IamIronMAN
  • 1,871
  • 6
  • 22
  • 28
  • Are you 100% sure the `userId`, `homeTown` etc are okay - constructing JDBC like that is often the cause of script injection attacks. To avoid use `PreparedStatement` – Martin Algesten Nov 26 '10 at 19:15
  • @martin yes they are ok. I can able to retrive values individually in English and Arabic. My problem is when i search user whose home town is in arabic i am not able to retrive values. I want to compare the values before the search query so that I will be able to match and retrive values. – IamIronMAN Nov 26 '10 at 19:26
  • What I mean is that a "malicious user" could, instead of typing in a `userId` in your web page, perhaps write some SQL that may be executed. Example: `; drop database mysql;`. String concatenation with user input data in SQL is generally a bad idea. – Martin Algesten Nov 26 '10 at 19:32
  • Also, search Google scholar for papers about **cross-language name search** – Ken Bloom Nov 26 '10 at 19:37
  • 1
    You may want to consider a hybrid of several different solutions, if you can (I'd probably subject each query to a lookup table, (double)-metaphone similarity, and maybe edit distance similarity in that order). Also remember, that maintaining a natural language processing solution can require care and feeding e.g. keeping data sets up to date, tuning based on user data, so keep a log of the queries that people throw at you and the results you return, and remember to tune things to see if you can get better performance. (E.g. you can see what errors users made and add them to the lookup table.) – Ken Bloom Nov 28 '10 at 00:45

6 Answers6

6

The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is

id   hometown   name
001  California Victor

you could introduce a language code and store

id   lang hometown   name
001  en   California Victor
001  ar   كاليفورنيا Victor

then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.

(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)

rsp
  • 23,135
  • 6
  • 55
  • 69
  • thanks but if thousands of users are there and they have many hometowns different for each. I dont know all the arabic translation for the hometown specified by the user right. I also cant add for thousand peoples hometown in Arabic in mysql.? Am I wrong if so correct me. – IamIronMAN Nov 26 '10 at 18:46
  • You can create said table, it will just require a lot of work either by you or somebody else whose work you can leverage upon. It is not a problem getting it in MySQL as such. – Thorbjørn Ravn Andersen Nov 26 '10 at 19:32
  • As Thorbjørn said, someone needs to provide the information in the extra language. If your user interface allows your users to add information in an additional language they can provide the data to search for. This setup does not make the additional language mandatory, if it is present you can use it in a search, if it is not it just works with the english information. – rsp Nov 26 '10 at 19:54
  • In Wikipedia, there are links between articles about the same topic in different languages. It would be a "small matter of programming" to use a Wikipedia dump to take all of the place names on Wikipedia and create a multilingual lookup table. – Ken Bloom Nov 28 '10 at 00:07
3

Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
  • @Thorb What is "sounds-Like" and " phoneme conversion" this is the first time I am hearing that. As it is hometown ie name of the place I most probably think the sounds will be same. Is it implementable? – IamIronMAN Nov 26 '10 at 18:29
  • 1
    As a concrete suggestion, you could try transliterating the arabic into roman letters, and then using metaphone or double metaphone to look for matches (these are standard algorithms - look them up). There was some discussion around this idea here - http://stackoverflow.com/questions/1419882/enabling-soundex-metaphone-for-non-english-characters - where the accepted solution was to try Lucene, which has some support for cross-language search. You're using java, so that would fit in your technology stack. – Tom Anderson Nov 26 '10 at 19:07
  • Phoneme conversion isn't always perfect. For example, Jerusalem is pronounced Yerushalyim in Hebrew. The two have different codes in metaphone and double-metaphone. And the two names have no phonetic relation at all if you take the Arabic name for the city "al-Quds al-Sharif". Nevertheless, I second this suggestion -- it's the best choice. – Ken Bloom Nov 26 '10 at 19:15
  • @Ken, I am aware of that. This was just the approach I would expect to require the least amount of data gathering in terms of lookup tables. – Thorbjørn Ravn Andersen Nov 26 '10 at 19:33
  • @Thorbjørn: I'm willing to bet that Leo-vin wasn't aware of that, so that's why I pointed it out. – Ken Bloom Nov 26 '10 at 19:42
  • @Thob @Klen Oh my God Betting on me? Sir's I am trying to put my efforts. Sorry for not knowing but am not wantedly doing. The fact is I am trying to learn and implement. – IamIronMAN Nov 26 '10 at 19:50
  • @Leo-vin, it's an expression -- no actual money changing hands here. – Ken Bloom Nov 26 '10 at 20:02
  • @Leo-vin, no need to Sir me (and probably others). We are essentially peers here. – Thorbjørn Ravn Andersen Nov 26 '10 at 20:27
  • @Thorb Thank you. Please see my edit and see whether you can help me out? – IamIronMAN Nov 26 '10 at 20:30
  • @Leo-vin, sorry no, perhaps others can. – Thorbjørn Ravn Andersen Nov 30 '10 at 19:59
2

Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.

This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
  • Given that writing the vowels in Arabic is optional, you may also want to throw out the vowels from any English names when doing cross-langauge search. – Ken Bloom Nov 26 '10 at 19:31
  • Something like what you have told I want to do like this. But am not getting any clues. I am searching over to solve this. – IamIronMAN Nov 26 '10 at 19:33
1

Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
0

How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).

Przemek Kryger
  • 687
  • 4
  • 11
  • @Kryger Am a newbie How to create a wrapper class. I learnt from net that its used to represent primitive values when object is required? What is that mean? How to use equal(object) In what basis it will compare and return true? Please guide me.. – IamIronMAN Nov 26 '10 at 18:42
  • @Leo-vin From your response for rsp, I can see you actually don't really know how to map e.g. "كاليفورنيا" to "California", as user may type in these names (add typos as well). There are 2 options: either you create the map somehow (user can pick up hometown from some kind of list) or you need to hire someone to map these names for you. – Przemek Kryger Nov 26 '10 at 18:53
  • @Kryger If millions of user are there and they have different hometown then I need to map all those to arabic then? – IamIronMAN Nov 26 '10 at 19:03
0

This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).

In your setup you will probably have the following points:

Browser <-> Servlet container <-> Database
                   |
                System.out

In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.

Browser to Servlet container

When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.

On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.

Example tomcat's server.xml

<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
           maxThreads="2000"                
           connectionTimeout="20000" 
           redirectPort="8443" />

Servlet container to Database

The database needs to have the correct encodings, or sorting etc will not be right.

Example my.cnf for MySQL

[mysqld] 
 ....
init_connect=''SET collation_connection = utf8_general_ci'' 
init_connect='SET NAMES utf8' 
default-character-set=utf8 
character-set-server = utf8 
collation-server = utf8_general_ci 

[mysql] 
 ....
default-character-set=utf8 

Then the JDBC-driver needs to be set for UTF-8.

Example JDBC connect string

jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8

System.out

System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!

Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

Martin Algesten
  • 13,052
  • 4
  • 54
  • 77
  • I already made encoding for my application. I converted it to UTF-8 in all including mysql. – IamIronMAN Nov 26 '10 at 18:58
  • This is not an encoding problem - كاليفورني is arabic for California, not an encoding error. Your advice is good, but i fear entirely irrelevant to the question. – Tom Anderson Nov 26 '10 at 19:02