1

I got a question about word classification in Spark. I am working on a simple classification model that takes a word (a single word), as an input and its predict the race of the named person (it is from a fictitious universe). For example, Gimli -> dwarf, Legolas -> elf.

My issue is on how to process the words. I know that Spark includes a two feature vectorization methods, tf–idf and word2vec. However, I am having difficulties on understanding them and do not know which one to use.

Could anyone explained them to me and guide through the process?. And more importantly, I would like to know which of these methods is the most appropriate for this case.

Thanks

user3276768
  • 1,416
  • 3
  • 18
  • 28

2 Answers2

2

Firstly we should be clear that the correct approach will depend on the data. *

This task is called language detection or language identification. Even for entire sentences or pages, vectors from entire words is not the right approach. (It would only work on names you have actually encountered in training, like a list, no real prediction.) Rather, you need an n-gram model based on characters. For example, in bigram model:
"Gimli" --> "_G Gi im ml li i_"

Unfortunately you cannot use pyspark.ml.feature.NGram for this out of the box, because it assumes a gram is a word, not a character.

What to do?

You must first find or write a function to do this transform to character n-grams, and apply it to both the original names and to queries that come into your system. (If names have spaces, treat those as a character too.)

Then, in Spark terminology, these character n-grams are your "words", and the string containing all (eg "_G Gi im ml li i_") of them is your "document".

(And, if you like, you can now use NGram: splitting words into ['G i m l i'] and then using NGram with n=2 should be equivalent to splitting into ['_G', 'Gi', 'im'...].)

Once you frame it in that way, it will be a flavour of the standard document classification problem (actually "regression" in strict Spark terminology), for which Spark has a few options. The main thing to note is that order is important, do not use approaches that treat it like a bag of words. So, although all of the Spark classification examples to be found vectorise with TF-IDF (and it will not completely fail in your case), it will be suboptimal because I assume that actually order/context of each character n-gram is important.

As far as optimising it for accuracy, there are possible refinements around alphabets, special chars, case sensitivity, stemming etc. It depends on your data - see below. (It would be interesting if you posted a link to the entire dataset.)

: * Regarding the data and assumptions about it:
the character n-gram approach works well for identifying actual human languages from planet earth. Even for human languages, there are special cases for classes of text like names, for example Chinese characters could be used, or languages like Haitian or Tagolog where many of the names are just French or Spanish, or Persian or Urdu where they are just Arabic - pronounced differently but spellt the same.)

We know the basic problems and techniques for words from major human languages but, for all we know, the names in your data: - are in random or mixed alphabets - contain special characters like '/' or '_' normally more likely seen in URLs - are numbers

Likewise interesting is the question of how they correlate to group membership. For example it could be that the named are randomly generated from alphabetic chars, or simply a list of English names, or generated using any other other approach and then randomly assigned to class A or B. In this case it is not possible predict whether names yet unseen are members of A or B. It is also possible that As are named for the day of the week on which they were born, but Bs for the day of the week on which they were conceived. In this case it is possible but not without more information.

In other scenario, again the same generator is used, but names are assigned to A or B based on: - length ie number (< or >= some cutoff) of chars/bytes/vowels/uppercase - length ie number (even or odd) of ... In these cases a completely different set of features must be extracted.

In yet another scenario, names of B are always repeated blocks like 'johnjohn'. In this case character n-gram frequencies can work better than random guessing, but is not the optimal approach.

So you will always need some intuition about the problem. It's difficult for us to make assumptions about an artificial world, from the 2 examples you have given we might assume the names are somewhat Englishish. And finally you must try different approaches and features (and ideally whatever classifier you choose simply ignores useless signals). At least in the real world, features like word count, char count and byte count are actually useful for this problem - they can augment the character n-gram approach.

Community
  • 1
  • 1
Adam Bittlingmayer
  • 1,169
  • 9
  • 22
  • 3
    I'm afraid this is not an answer, rather a comment! – eliasah Sep 24 '15 at 07:51
  • The asker was really on the wrong track. I will modify my answer to make it a bit more concrete with regard to Spark. – Adam Bittlingmayer Sep 25 '15 at 08:48
  • Hi Adam. Thanks for the reply. Do u mind if I ask you something else? Once I have my array ['G, 'Gi', 'im',...], should I feed it that way to classification model, or should I transform it first using vec2word or tf-idf? Thanks again. – user3276768 Sep 25 '15 at 13:53
  • 2
    The examples I see all vectorise with TF-IDF. It will not fail in your case, but I think it will be suboptimal because I assume that actually order/context of each character n-gram is important. I would actually transform the vector with https://spark.apache.org/docs/latest/ml-features.html#n-gram (and I think you do it like that, you can actually use *unigrams*, ie n=1, `['_G', 'i'...]`). But without knowing your data I cannot say - you should try a few ways. – Adam Bittlingmayer Sep 27 '15 at 17:02
  • Updated my answer a bit – Adam Bittlingmayer Sep 27 '15 at 17:10
  • Hey Adam, thanks for your responses. The dataset is just names, nothing special; just characters from the alphabet. One last question (I will mark your response as the answer regardless), in your original answer you said that I cant use Spark's NGram functions because it assumes a gram is a word (and I agree with you), however, on one of your comments, you said that you would use NGram to transform the vector. My question is, why you recommended me not to use NGram, if you would use it (maybe I misunderstood you). Once again thanks a lot for your help! – user3276768 Sep 30 '15 at 00:20
  • 2
    @user3276768 Pardon for being unclear. I mean, you cannot use NGram on the original document "Gimli" because anyway it is only a single word. After you split it up, you can use it. (Using NGram with n=2 after splitting into `['G i m l i']` and splitting words into `['_G', 'Gi', 'im'...]` are more or less equivalent.) – Adam Bittlingmayer Sep 30 '15 at 06:31
  • 2
    About the data, I just want to emphasise that without seeing them I cannot make any guarantee about this approach. For example it could be that the named are randomly generated from alphabetic chars, or simply a list of English names, and then randomly assigned to race A or B. In this case it is not possible predict the race of new names. In other scenario, again the same generator is used, but the names of B are simply longer than the names of A. Or names of B are always something repeated like 'johnjohn'. Perhaps race is assigned by whether the sum of char bytes is even or odd. – Adam Bittlingmayer Sep 30 '15 at 06:40
  • 2
    So you must have intuition about which features are interesting for your data. For identifying actual human languages, the character n-gram approach works well. – Adam Bittlingmayer Sep 30 '15 at 06:41
  • 1
    My pleasure. I updated the answer a bit to reflect our comments. – Adam Bittlingmayer Oct 02 '15 at 07:17
1

No model can predict the race of species from just name.
You can create a Lookup dictionary of all possible Character and their race using Wikipedia or DBPedia and then pass the name to function and get the Race.
If the data is huge and you want to do it in less amount of time you can use join and do it.

Ajay Gupta
  • 3,192
  • 1
  • 22
  • 30
  • This is also not an answer, rather a comment! – eliasah Sep 27 '15 at 18:15
  • @eliasah : There are some question which does not have an answer.And these are one of them. And thats the technique we apply to do mapping of data from it as no model can guess the race from a name unless it is trained on some parameters(variables). – Ajay Gupta Sep 27 '15 at 18:18
  • As much as I can agree with what you have said, this remains a comment and if you judge that it's too broad, you can flag as so and moderators will look into it. – eliasah Sep 27 '15 at 18:22
  • The issue here is not the model or the prediction; this is something I am doing for fun. I'm just wondering about the best way to tackle the problem. – user3276768 Sep 28 '15 at 14:36