-2

I have streaming strings (text containing words and number).

Taking one line at a time for streaming strings, I would like to assign a unique value to them.

the examples may be:strings with their scores/hash

  User1 logged in Comp1 port8087       1109      
  User2 logged in comp2                1135
  user3 logged in port8080             1098
  user1 logged in comp2 port8080       1178       

these string should be in same cluster. For this what i have thought is mapping(bad type of hashing) the strings such that the small change in the string wont affect the score that much.

One simple way of doing that may be: taking UliCp8, Ulic .... ( i.e. 1st letter of each sentence) and find some way of scoring. After then the similar scored strings are kept in same bucket and later on sub group them.

The improved method would be: lets not take out first word of each word of the string but find some way to take representative value of the word such that the string representation may be quite suitable for mapping with score/hash as i mention.

Considering Levenstein distance or jaccard_index or some similarity distance metrices, all of them require inputting the strings for comparisions. Isn't there any method to hash/score the string as stated without going for comparisions.( POS tagging, comparing looks uneffiecient for my purpose as the data are streaming, huge in number, unstructured)

Hope you understand what i want to achieve and please help me out. Forgot about the comments below and lets restart.

  • 1
    Why are you doing this? You should always explain your end goal, not just what step you are having problems with. – Dour High Arch Jul 07 '11 at 18:12
  • 1
    It's amazing how useful the answers are to such a vague, homework-ish question – Jared Updike Jul 07 '11 at 23:12
  • well.. the end goal is to cluster the string messages, recognizing their patterns.. for example : "my name is jared" & "my name is dour" are strings. there are lots of strings as such. To map the string with proper (set of)integers such that the hash value of the sentence should be near to each other. we can develop (or use any existing idea if there's any) some algorithm to hash the strings so that similar strings are kept into same bucket. Then after that there is way to find out the patterns for the bucket. in our case "my name is *" is the required pattern that the project should output. – ajan lal shrestha Jul 08 '11 at 10:52
  • i guess i made clear to you. the main aim of the project is to develop "event log classifier". it may be that i wasn't able to explain my problem but i think its one of the most needed area of research specially when considered about network security. – ajan lal shrestha Jul 08 '11 at 10:57

2 Answers2

2

"at least two similar word (not considering length) should have similar hash value"

This is against the most basic requirements for a hash function. With a hash function also minimal changes to the input should produce vehement changes to the bucket the hash falls into.

You are looking for an algorithm that calculates the similarity or distance between two inputs.

Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • emboss already pointed to the Levenshtein distance, which might be what you are looking for. – Hyperboreus Jul 07 '11 at 18:41
  • It would be really easier to help you, if you commented on what you are trying to achieve with it? Find lexical or semantic or semiotic similar strings? Is it human language you process? Which one? Can you give some examples of sets of strings that should be "similar" and others that should be "dissimilar"? – Hyperboreus Jul 07 '11 at 18:45
  • ok is there any way of doing so...i dont mind that being against basic requirement of hash value. i need some ideas please help me out...or is there any ways to map or has entire string with any integer(or set of integer; in this case the integer value(s) should be dependent on length of word and positional behavoiur of word). SO the conclusion is that: i need some ideas (or algorithm to map word by a number such that it doesn't depend on number of letters on word and/or : i need to map entire string (one line text) such that it should depend on number of words. did you get me? – ajan lal shrestha Jul 07 '11 at 18:53
  • i dont think testing the semantic or lexical patterns would do that. certainly this might be one way but what i have thought is for streaming algorithm. I wast to Classify the event logs considering only the message part of logs. example i want to find patterns(or regular expressions) like " Authentication for * is accepted" (this is just an example). did i made clear to? i can describe them further but my overall aim is not just random or without any base. i've tried to describe them in other comment as well. please go through them once. i really need help. – ajan lal shrestha Jul 08 '11 at 11:07
  • OK I will Learn my self.... please provide some way( at least some name or term or tags) to hash string/sentence only considering itself(it means not comparing with other strings) . AT LEAST PROVIDE SOME USEFUL TERMS/tags/keywords/technique name SO THAT I MAY EXPLORE THEM. this is much important. i might lose my final project and engineering degree... – ajan lal shrestha Jul 08 '11 at 16:17
  • Your intention to learn yourself is very noble, especially when it is your degree and not SO's degree. CAPS LOCK IS CRUISE CONTROL FOR COOL BUT EVEN ON CRUISE CONTROL YOU HAVE TO STEER. Search terms for hashing are "hash function", "md5", etc. But I doubt that this is what you need. Try google for "clustering", "Levenstein", "tesselation", etc. – Hyperboreus Jul 08 '11 at 16:29
  • If you want "Error 1234 has occured on connection abcd" and "Error 5465 has occured on connection fsefsef" to fall into the same cluster but not "Connection reset by client abcd". Use the levenshtein distance on a word base as already mentioned thrice. Difference between the first to is 2, and between first and third is 8. Have you even read the link about Levenshtein? – Hyperboreus Jul 08 '11 at 16:36
  • sir, i have gone through Levenshtein. but please tell me. there would be thousands of sentences (in system i am using this the data inflow is approx 20000logs/sec. is it possible to compare each and every of them? i already mentioned streaming logs. "i dont need the answer to my problem" What i need is just some methods/algorithms of sentence hashing that give priority to sentence length and position of words. these hashing should be independent of other sentences that follow it.-hashing sentences on its own. i am going to improve the algorithm and apply acc to nature of my inputs. thats all. – ajan lal shrestha Jul 09 '11 at 02:43
  • well Hyperboreus , tell me one thing... how can i compare about million sentences(logs or whatever) using Levenshtein distance for similarity checking. i understood ur example.."Error 1234 has ...." but its not about some hundred logs or so...i am talking about somw millions and comparing them using Levenshtein is NOT FEASILBLE. – ajan lal shrestha Jul 09 '11 at 09:38
  • And Hyperboreus, the various hashing algorithms including MD5 etc etc that i found implement similar idea "there is drastic change in the value for small change in string" i just want opposite. please dont be angry with me. if u give me ur email id i can explain in more clearer way in more detail - my problem and what i've thought about solving that. but i need some help guys. i really need some way to start. – ajan lal shrestha Jul 09 '11 at 09:44
  • Look dude, your problem is that you keep on babbling about hashing without knowing what it is and what you want. Just erase this word from your dictionary. Translation your question into plain english, it reads: "I am looking for a cow, but it must fly and must not give milk." – Hyperboreus Jul 09 '11 at 15:12
  • ok i may not have enough background to discuss. i maynot have choosen proper key word while describing problem. arguing here wont solve my problem. thanks for ur help. i will better myself and again ask u question. i hope u dont mind answering if i improve :) – ajan lal shrestha Jul 09 '11 at 17:37
  • Sir i have improved the question. hope that clearly specifies my intensions and hope you would help me throughout – ajan lal shrestha Jul 13 '11 at 02:42
2

As stated you are not looking for a hash function, rather something like the Levenshtein distance which is an algorithm for calculating a metric representing the degree of differences between two sequences of bits. It is commonly used to find out how similar/dissimilar two strings are. Hashing / message digests are good for creating identifiers for unique, distinct values but they will produce entirely different results for "similar" values.

You are interested in the similarity of strings. Here is a nice post that names a few resources that are used for measuring string similarity. Maybe Lucene could help you in your situation.

Community
  • 1
  • 1
emboss
  • 38,880
  • 7
  • 101
  • 108
  • yes i exactly want that..so then please tell me other method than the Levenshtein distance(i have read that and cant be used in ours). as far as i have known after all these comment, i want algorithms/methods/idea exactly complementary of hashing techniques – ajan lal shrestha Jul 09 '11 at 04:06
  • @ajan: See my edit. Those should be good starting points, so at least you should now know what to look for. – emboss Jul 09 '11 at 13:24
  • You're welcome. Just as a reminder, if you feel your request is solved, please vote for good answers and accept the answer you were most pleased with. Many new users seem to forget... :) – emboss Jul 09 '11 at 17:55