3

First time asking a question on here;

I am looking for a way to be able to use a search algorithm, or a built in method to dynamically search for repeating sequences within a string, or other variable.

The reason I say dynamic, is because I want it to be able to search through the string and locate repeating sequences on its own. I am not going to be able to supply a constructor of a sequence to look for.

I am unsure if this is even possible, but if it is, all help would be appreciated!

Here is a basic visual representation of what I am looking for (mind you, this is not code, just a for instance of a string)


This is going to be a long string that will have sequences throughout it. This may have matching characters side by side or it may not, but regardless, this is going to be a long string. If this is going to be a long string, I need it to find these sequences throughout it on its own!


As you can see by the above example, there are 2 sets of matching sequences throughout the single string. If there is any way to identify these programatically, along with being able to be searched through very fast for these different patterns, it would help me significantly!

The matches will most likely be stored in a List / array for later use as well.

Thank you for any help you are able to provide!


Edit: As this question was asked, case sensitivity will not be an issue.

When I was mentioning there were 2 matches, I meant that 2 particular sequences, had a duplicate. One of which, had 2 duplicates.

@HenkHolterman You are correct that this is going to be a compression algorithm, however, I was not sure where to start for looking for the sequences that I will be matching.

I had been doing multiple searches regarding something similar to this, but was coming up short with the answers I were looking for. That is why my question was posed here the way it was.

Thank you for all the responses I have gotten so far though!

Justin
  • 31
  • 4
  • This is a nice problem. Can you define what a **sequence** is? Is anything that contains two spaces a sequence? – Andre Calil Apr 23 '13 at 18:11
  • you can do brute force I'd assume that would be about **O(n^3)** – Sam I am says Reinstate Monica Apr 23 '13 at 18:11
  • 1
    This is at least related to compression algorithms. – H H Apr 23 '13 at 18:11
  • Hi, welcome to SO. [**Read the FAQ**](http://stackoverflow.com/faq). Looks like you have a pretty distinct idea of what you want to happen, that's great! What we need you to do is try this idea on your own. See what you can put together by doing some research independently. When you've come up with some code and hit a wall that you just can't figure out, come post what code you have, describe the problem, and we'll be happy to help. – tnw Apr 23 '13 at 18:11
  • Thanks for the quick responses! A sequence (in what I am looking for) is matching data that in a side by side comparison. These sequences can be in different places, but I want to capture the places and patterns that were matched and hold those in (most likely) a volatile variable. Thanks again! – Justin Apr 23 '13 at 18:20
  • While you are trying this on our own as @tnw has suggested, you may want to think of min lengths, if case should matter, if white space should be ignored, and other details. – Scott Adams Apr 23 '13 at 18:22
  • 1
    Note that you're most likely going to end up with **his is going to be a long string** unless you do your search in a case insensitive way. Depending on your uses for this, it may or may not be acceptable. – dlras2 Apr 23 '13 at 18:42
  • @DanRasmussen For the actual case that I plan on utilizing this, case sensitivity is not an issue, but thank you for the input! – Justin Apr 23 '13 at 19:01
  • 1
    " this is going to be a long string" has two matches and is longer than "this is going to be a long string" but the former only has two matches and the first one has three. What goes? – Jonas Elfström Apr 23 '13 at 19:04
  • What about overlapping sequences? Ones that are part of a longer sequence but also appears multiple times outside the longer sequence. – Jonas Elfström Apr 23 '13 at 19:06
  • @JonasElfström I will be storing the results into a list / array to perform a check if that "specific" sequence has already been located elsewhere. The larger the sequence the better, however, smaller sequences will be counted as well, but only if they are not already located within a larger sequence. Hope that clears things up! – Justin Apr 23 '13 at 19:11
  • It might but could you please apply that on: "Rudolph: I want a vombat. Adolph: I want a bonobo. Rudolph: I want a vombat too! Adolph: Then I don't want a bonobo." – Jonas Elfström Apr 23 '13 at 19:13
  • I am tempted to close this as not constructive; the topic is good for discussion but appears too ill-defined to generate any real answer. – dlras2 Apr 23 '13 at 19:16
  • @JonasElfström in the reference that you just gave... "Rudolph: I want a vombat" and "want a bonobo" would be the matching sequence for the algorithm that I am creating. – Justin Apr 23 '13 at 19:16
  • @DanRasmussen Give me a short bit, and I will see if I can't come up with some code for this. The issue currently being, I am at work at the moment, whereas my project is at home. I came to work looking for my solution. I will get a code reference up here shortly-- – Justin Apr 23 '13 at 19:18
  • And Justin put up some more `metrics` on what you're looking for - it's too wide like this. So is it string with words (I mentioned) or not, (char match). Also how long is the text possibly, how much of it you have in terms of mem or speed etc. – NSGaga-mostly-inactive Apr 23 '13 at 19:20
  • @NSGaga As stated a moment ago, I will completely re-vamp this question here shortly once I get home. The basic principle behind this will be searching a very large datatype as most likely char's rather than strings. For my reference, without having access to my code at home, I figured that the string reference would be the most simplistic form for an analogy. This will be characters, most likely, it will be binary read in as hexadecimal from a file. – Justin Apr 23 '13 at 19:25
  • @Justin I think maybe rewriting the question with more information and specific problems would be good, but perhaps start a new question? This one has too much discussion in the comments here that may be outdated. – dlras2 Apr 23 '13 at 20:12
  • @DanRasmussen Roger, kill it and I will rewrite at home! :) – Justin Apr 23 '13 at 20:20

2 Answers2

1

Here's the basic brute force idea

  • first you find all repeating sequences of size 1(you can change the minimum size to whatever you want).

To do this, you essentially go down the line, and use a regex to find all of the Ts and then all the hs, etc...

  • Then you find all sequences of size 2, so you'd find all the Ths and the his and the iss

  • you repeat this until you have found all of the sequences.

The runtime would be

  • the time complexity to find a particular sequence with regex: O(n)
  • times the number of different sequences of a particular size: O(n)
  • times the number of sizes: O(n)

the total time complexity would be O(n3)

  • Is this particular method likely to be fast? I am not familiar with regex, so I will research that! Is this going to go through the string char by char? Or will this look into finding the patterns completely on its own? – Justin Apr 23 '13 at 18:22
  • @Justin depends on your definition of fast. `O(n^3)`, can be taken to mean that the time it takes to run the operation can be proportional to the size of your input cubed. That's generally not considered to be very fast, but at least it's not NP hard – Sam I am says Reinstate Monica Apr 23 '13 at 18:25
  • @Justin, the speed will be greatly increased the larger you make the minimum length of the patterns you are looking for. I can assume you don't care about finding all the "t"s in a string. So, if you can start off knowing that you only care about patterns that are 8 chars or greater, you will have a much faster process as per SamIam's answer. – Scott Adams Apr 23 '13 at 18:29
  • Yes, thank you. I do fully intend to have the minimum sequence multiple characters long. My thoughts at this point are to have them roughly about 15-20 characters for the minimum length with possible room to play. In addition, does this work with only a minimum size? Meaning, if there is a sequence that exceeds my parameters for shortness, but is still matching well over my minimum, will this still work for that? – Justin Apr 23 '13 at 18:35
  • @Justin i do not understand your question, but if your minimum is 20, it will match all sequences of length greater than 20 – Sam I am says Reinstate Monica Apr 23 '13 at 18:36
  • @SamIam My apologies, I somehow misread your answer during that section. Thank you for that! – Justin Apr 23 '13 at 18:38
  • 1
    @Justin - Just to add in here (since already talk about boundaries) - I would recommend making it a `word boundary` actually - as it (based on your example) makes most sense to search for duplicates that are 'group of words' not at 'half of the word'. Given that you might be able to make a more performant algorithm (though not sure about Regex in that way, wasn't thinking strictly of that) – NSGaga-mostly-inactive Apr 23 '13 at 18:45
  • @NSGaga You are correct, searching for the entire word would make most sense, rather than using partial finds. This being said, I did modify my initial question to give a little bit more information. This algorithm is going to be used for compression, but the likelihood of having an entire word, is rather slim for the algorithm I have been contemplating... That is why I was mentioning rather often, "sequences" – Justin Apr 23 '13 at 19:14
0

Use a suffix tree to do this in O(n) time. I am adding this extraneous sentence to keep this from being converted into a comment.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158