0

My training data contains of text like

EMI3776438, U9BA7E, 20FXU84P, 4506067765, N8UZ00351

I am using the K-Neighbors classifier algorithm.

Right now, the method I am using is to convert the alphabets to a number.

For example, a/A would map to 10, b/B would map to 11, c/C would map to 12. After the conversion, I will send this data to the K-Neighbors classifier.

So, for example, ABI37 becomes 1011I37.

The problem with this method is that both AA and 1010 will map to 1010 and there is no way for the algorithm to differentiate them and classify properly.

Is there a good method to convert these to only numbers (since this algo only works on numbers) so that the real value and classification can be done correctly?

Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36
  • https://stackoverflow.com/questions/53420705/python-reversibly-encode-alphanumeric-string-to-integer – Ashu Grover Feb 20 '19 at 10:41
  • Yes you can simply convert characters to int for example but this misses the key point that it does not necessarily give a meaningful measure of string-string 'distance' as needed fro k-nn to give something sensible. – peter554 Feb 20 '19 at 10:45
  • It is common task in NLP, try using Word2Vec. – Maciej M Feb 20 '19 at 10:49

1 Answers1

0

To do this you first need to decide on a distance (or 'metric') for string comparison. Once you have a metric then applying k-nn to the data will be easy as k-nn just needs to be able to ask 'what is the distance between two data points?'. See this Wikipedia article for ideas.

You can simply convert from characters to int as you suggest but this misses the key point that it does not necessarily give a meaningful measure of string-string 'distance' as needed fro k-nn to give something sensible. The choice of a best metric will depend on the particular problem details i.e. what your data actually represents!

This issue discusses a similar problem.

peter554
  • 1,248
  • 1
  • 12
  • 24