0

Let's say I have a large text file (few MB to GB) of random text, consisting only of lowercase letters, no spaces. However, someone appends a string somewhere in the middle of it (consisting only of lowercase letters, no spaces) of words from the English language.

How would I go about finding where that string is and what it says, given that I do not know what the string is supposed to say (only that it's in English, and not completely random text)? I can use a dictionary of English words.

  • 1
    hsctf is a is tough buddy – rush2sk8 May 19 '14 at 01:08
  • The thing is that without a very distinguishable word to work from it's going to be difficult to tell actual English from noise, especially for shorter words... – awksp May 19 '14 at 01:11
  • the file is 10mb and there are no spaces – rush2sk8 May 19 '14 at 01:12
  • There is a brute force algorithm. Say that the longest word in your given English dictionary is of length `L`. You would then loop over each character in the file, and iteratively look at strings of length `1` to `L` with each given character that you're on as the starting point of that string. You would then compare those strings to every word in the dictionary to see if you have a match. This has some problems: words like `a` and `an` are likely to appear *a lot*, and this would take **forever** given the file and dictionary size (i.e. don't use this algorithm). It also uses a ton of loops. – ajp15243 May 19 '14 at 01:13
  • the Word to find is contained on the same line? or can it be splited on to the next line? – Typo May 19 '14 at 01:28
  • @boolean Seems to me like there's no whitespace at all in the file – awksp May 19 '14 at 01:29
  • @user3580294 carriage return is not a White space – Typo May 19 '14 at 01:32
  • @boolean Java regex (and Unicode) says otherwise, at the very least; don't see any reason why they shouldn't be considered whitespace – awksp May 19 '14 at 01:35
  • @user3580294 an ascii table says otherwise – Typo May 19 '14 at 01:46
  • 1
    @boolean And how so? Whitespace includes more than just the space character... And I'm fairly certain that ASCII doesn't define what "whitespace" is – awksp May 19 '14 at 01:47
  • 1
    I think the point of the problem is that there are only lower cases characters, nothing else... – Matthieu May 19 '14 at 01:48
  • @user3580294 new line = 12, carriage return = 15, Space = 32 – Typo May 19 '14 at 01:52
  • @boolean And tab = 9. Tabs are whitespace. And so are newlines/carriage returns. Again, whitespace includes more than just the space character. – awksp May 19 '14 at 01:53
  • @user3580294 like I said, it is not the same – Typo May 19 '14 at 01:59
  • @boolean You seem to be missing my point. I'm not saying that newlines are the same character as spaces or something like that. I'm saying that the term "whitespace" *includes* all of those characters (and more). Look [here](https://en.wikipedia.org/wiki/Whitespace_character#Unicode) if you want. – awksp May 19 '14 at 02:01
  • @user3580294 the "term whitespace" doesn't apply to character computation. – Typo May 19 '14 at 02:02

1 Answers1

0

Build the dictionary into a trie and traverse the file. O(n) time in size of file (I believe O(file size * trie depth) in worst case) and O(1) memory (fixing the size of the dictionary and assuming small largest word). This is also streamable and is very RAM-efficient, so would scale to, say, a terabyte of data with only a gigabyte of RAM.

djechlin
  • 59,258
  • 35
  • 162
  • 290