1

Im looking for a way to find all repeating sequences consisting of at least 3 characters on an input file and then print out the most frequent ones! it seems to require a lot of string processing and intense searching through the input file specially cause there is no upper bound on the max size of the patterns to look for!

Is there any efficient algorithm to do it with the least possible processing and messiness? Should I use string.h or I'd be better off with char arrays? Any tips/helpful snippets etc. on how to start?

tnx

Saba Ahang
  • 578
  • 9
  • 24
  • Do you need to find all of them or the most frequent? – Luchian Grigore Jun 19 '12 at 11:13
  • can you give expected file and output –  Jun 19 '12 at 11:15
  • If the file is small, you can load it all into memory so you don't have top read it several times. You could also research things like [suffix trees](http://en.wikipedia.org/wiki/Suffix_tree)? – Some programmer dude Jun 19 '12 at 11:16
  • this could indeed be handy to find violations of the DRY principle in a codebase :] – moooeeeep Jun 19 '12 at 11:21
  • Do you have a preference for longer sequences? Because any common sequence of 5 characters (abcde) will always require at least that number of 4 character sequences (abcd, bcde) and 3 character sequences (abc, bcd, cde), so the longer sequences would always have lower frequency than the longer sequences. – Phil H Jun 19 '12 at 11:21
  • Just to clarify, are you perhaps looking for streaks of more than 3 equal characters in a row? – vgru Jun 19 '12 at 11:23
  • @LuchianGrigore: eventually I need to find the most frequent for I supposed that it would be wise to save any repetitions for the sake of deployments if lower frequency patterns was in question. but What difference does it make? – Saba Ahang Jun 19 '12 at 15:10
  • @gcc: say the input includes the following string: "theras junder thera the juju junderasther" the output should be sth similar to: the=4, ther=3, era=3, eras=2, ras=2, jun=2, der=2, junder=2, her=3, hera=2, und=2, unde=2, under=2, nde=2, nder=2, I might have missed a pattern!!! but according to this we may call "the, ther, era, her" the most frequent patterns – Saba Ahang Jun 19 '12 at 15:28
  • @JoachimPileborg how come I dont need to read it several times if it be on memory? in any case I for every pattern I should go through the whole string. no? how to load into memory? – Saba Ahang Jun 19 '12 at 15:30
  • @PhilH thats a good question actually but I think I need to count all the different more-than-3-letter patterns regardless of whether they are inclusive of some other patterns or part of a bigger pattern – Saba Ahang Jun 19 '12 at 15:35
  • @Groo no no nothing equal!! just similar... look at the sample I provided n comments – Saba Ahang Jun 19 '12 at 15:36
  • Given your comment about the=4 etc, I suggest you update the question to read *at least* 3 characters, not *more than*. Then @LuchianGrigore's answer would be that the most common is 3 characters long. – Phil H Jun 19 '12 at 16:12

2 Answers2

5

I would suggest that you do create a suffix tree from the file. This will have linear complexity with respect to the size of your file and will solve the problem. You can modify the algorithm just a little bit to store how many times is a string met apart from the string itself. Here is a great post explaining how to create a suffix tree.

Community
  • 1
  • 1
Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
  • This method also allows you to find longer sequences given some heuristic for preferring them; by looking for the children of a node in the tree, you can see if the majority of instances of abc are in strings of abcd, and these in abcdefg, etc. – Phil H Jun 19 '12 at 11:23
1

Finding the most frequent one is quite easy, if you realize that the most frequent sequence is 4 characters long. It can be done in O(n) time, where n is the size of the input file.

You can build a std::map<string,int>, iterate character by character taking sequences of 4 characters at a time, and increment the value with the respective key in the map.

Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
  • The problem stated that the length of the strings is unbounded. If you can't determine the length of the strings then this method wouldn't work. Also, you would need to add 4 characters to the map for each character in the file. – jlunavtgrad Jun 19 '12 at 12:55
  • 1
    @jlunavtgrad if you're only looking for the most frequent one, it doesn't matter. The most frequent substring is 4 characters long. – Luchian Grigore Jun 19 '12 at 12:57