2

I have two different files each of whose content is coming from different streams of data. I have some data collected from these streams in two different files. Then i want to search the files to find any sort of patterns, So that at a later stage if i collect some more data from the streams i should be able to distinguish which data belongs to which stream (based on the patterns that i have found earlier).

An example of the data contained in the file can be : b0 82 91 a2 c3 89 b0 82 4a e3....(more bytes)... Though i have taken very few bytes here, but we can find the pattern "b0 82" coming twice above. So the output should show the pattern and the no of times it is coming. Similarly we can have 3 byte pattern or even more byte pattern.

Still other example can be : aa 00 a7 2f 7b 4c ....(more bytes).....aa 01 a7.........(more bytes)......aa 05 a7..... I think even this can be considered a pattern of 3 bytes where two bytes (aa & a7) are fixed and middle one varies from 00 to 05.

These are two examples that i could think of though there can be more patterns possibly. Even there may be some hidden patterns which can't be visualized immediately. The whole idea is any pattern will do as long as that helps to distinguish between two streams at a later stage. I think i am more clear now on specifying my problem. Please let me know the following things :

  1. How can we do this type of pattern finding?

  2. Are any tools or libraries which can help for this purpose?

  3. Also which language or tool to use for efficient and faster development?

  4. can the field of data mining help for this purpose ? If yes how to go ahead with that?

mezda
  • 3,537
  • 6
  • 30
  • 37
  • Can you be more specific about what you mean by "pattern?" – templatetypedef Feb 15 '12 at 08:42
  • pattern can be anything which can be distinguished from the rest of data. For example, it can be any byte say 0x4a or 0x56 or any byte. or even combination of them like 0x4a56. Further, if say there are some bytes which are having their 5 most significant bits same, while lower 3 bits vary from 000 to 111, then this also forms a pattern since 5bits are same at several places. This is what i can think of possible patterns. Still may be you can think of more such patterns, only thing that i want is they should be easily distinguishable. – mezda Feb 15 '12 at 09:00
  • 1
    This is impossible to do in general, since pretty much anything can be a pattern. What do you want to use this for? Perhaps there's a more specific problem? – templatetypedef Feb 15 '12 at 09:44
  • 2
    **Why hex**? Do you search for patterns that are aligned on 4-bit borders? – Has QUIT--Anony-Mousse Feb 15 '12 at 11:43
  • @templatetypedef it's called machine learning. Have you heard of it? – Neil McGuigan Feb 16 '12 at 04:23
  • @templatetypedef...anony-mousse...el chief: i have explained my prob in some detail above. please let me know if i am clear now and also possibly the answers of my questions above. thanks – mezda Feb 16 '12 at 06:32
  • @user1182722 You should do text mining on this. It is essentially the same. You will need to split the document/stream into "words" – Neil McGuigan Feb 16 '12 at 18:41
  • @elchief : thanks for the reply...as i am new to this area, can you please let me know any books, or links and the required tools for text mining...and which language will be best suitable to go ahead for this type of problem...will C be good ? Are there any libraries for text mining avialable ? please let me know. thanks – mezda Feb 17 '12 at 12:49
  • The best text mining tools are rapidminer and gate. Gate is more powerful, but rapidminer much easier to use. – Neil McGuigan Feb 17 '12 at 21:50

4 Answers4

1

This seems like a pretty typical ngram finding problem. Here is a link to some ngram solutions.

quicker way to detect n-grams in a string?

You should treat your hex just like any other string.

Community
  • 1
  • 1
hockey_dave
  • 524
  • 1
  • 7
  • 18
0

You can train Markov Models or even hidden Markov Models on your streams and use these models to decide which stream new data most probably belongs to. There are supposedly tens of libraries which can do this in your programming language of choice.

Maybe start by reading a book. I suggest Pattern Recognition by C. Bishop.

ziggystar
  • 28,410
  • 9
  • 72
  • 124
  • can you let me know the available libraries. i am using c over linux. or u can tell me libraries available for other platforms. thanks. – mezda Mar 13 '12 at 12:03
-1

Here is another idea. Whether it works for you depends on how much data you are processing, how much memory you can use, and whether the kinds of patterns it detects end up being useful for your purposes.

With all these qualifications in mind, you may want to try using a suffix tree or suffix array. Especially for suffix trees, there are algorithms that enable you to constantly update the tree as you append characters to the text (so called online suffix tree construction), most notably the Ukkonen algorithm. This might work particularly well with the use of data streams (as opposed to a fixed-length, fully defined input text).

A suffix tree (and, in a similar way, a suffix array) represents all suffixes (in the sense of string ends, not linguistic suffixes) of a text. As such, it is particularly suitable for (a) checking whether any given string is a substring of the text, and (b) for detecting repeated substrings in the text. With a few modifications in the right places, it can be used to detect repeated substrings with slight alterations (like your example of a pattern that is repeated with one character exchanged in the middle).

For a thorough introduction to these data structures, and if you have access to a university library or have the money, Dan Gusfield's introduction to algorithms on strings, trees and sequences will be very helpful. But there are a number of questions and answers related to this on SO, too.

If, after some further reading, you think it's worth a try, I could elaborate further on how I think suffix trees could be used for your purposes, in an answer to a new question specifically about the use of these algorithms for repeated pattern detection.

jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • Thanks @jogojapan for the answer. i will get back to you on this. before that i need to do some reading on this. – mezda Mar 13 '12 at 12:02
-2

Your question isn't completely defined, but I will try to give you a few pointers:

  1. Your patterns are probably expressible as regular expressions. If you don't know what these are - I would try to look for a concrete example in your favorite programming language. Python is a good option (the re module is included in the core language). For C++, use boost::regex, and for other languages, use google :)

  2. Now - for using regexes to search binaries (hex) instead of text, try looking at something like this.

Good luck :)

Guy Adini
  • 5,188
  • 5
  • 32
  • 34
  • i have explained my prob in some detail above. please let me know if i am clear now and also possibly the answers of my questions above. thanks – mezda Feb 16 '12 at 06:33