Searching words in array

Question

Possible Duplicate:
How to find list of possible words from a letter matrix [Boggle Solver]

I have a String[][] array such as

h,b,c,d
e,e,g,h
i,l,k,l
m,l,o,p

I need to match an ArrayList against this array to find the words specified in the ArrayList. When searching for word hello, I need to get a positive match and the locations of the letters, for example in this case (0,0), (1,1), (2,1), (3,1) and (3,2).

When going letter by letter and we suppose we are successfully located the first l letter, the program should try to find the next letter (l) in the places next to it. So it should match against e, e, g, k, o, l, m and i meaning all the letters around it: horizontally, vertically and diagonally. The same position can't be found in the word twice, so (0,0), (1,1), (2,1), (2,1) and (3,2) wouldn't be acceptable because the position (2,1) is matched twice. In this case, both will match the word because diagonally location is allowed but it needs to match the another l due to the requirement that a position can not be used more than once.

This case should also be matched

h,b,c,d
e,e,g,h
l,l,k,l
m,o,f,p

If we suppose that we try to search for helllo, it won't match. Either (x1, y1) (x1, y1) or (x1, y1) (x2, y2) (x1, y1) can't be matched.

What I want to know what is the best way to implement this kind of feature. If I have 4x4 String[][] array and 100 000 words in an ArrayList, what is the most efficient and the easiest way to do this?

I've tried to do as I described the program should work: locating the first letter, matching the letters around it and repeating the process. The problems occur when there are multiple same letters around it. What comes to the second question, it can start at any cell. The length of the words is from 3 to 10 letters (most common words in my language). Also, I wanted to discuss on the best solutions, not how to fix my slow and buggy version. I think the best solution will be something similar though. — MikkoP, Nov 10 '12 at 22:03
If I understood correctly you cannot go back, i.e. after reading h at (0,0) you cannot read it again in the same search cycle. In this particular case (small number of letters, large number of words) I would create the list of all possible combinations (by saving not only the letter, but the whole path) using for example depth first search (since it will require not that much memory to save your intermediate steps). But response really depends a lot on conditions of problem, if array will get much bigger then 4x4, this solution will have serious problems. — Serhiy, Nov 10 '12 at 22:10
Yes, a coordinate can be used only once, so (0,0) can not be accessed multiple times. — MikkoP, Nov 10 '12 at 22:18

Markus A. · Answer 1 · 2014-05-20T22:16:02.743

I think you will probably spend most of your time trying to match words that can't possibly be built by your grid. So, the first thing I would do is try to speed up that step and that should get you most of the way there.

I would re-express the grid as a table of possible moves that you index by the letter. Start by assigning each letter a number (usually A=0, B=1, C=2, ... and so forth). For your example, let's just use the alphabet of the letters you have (in the second grid where the last row reads " m o f p "):

 b | c | d | e | f | g | h | k | l | m |  o |  p
---+---+---+---+---+---+---+---+---+---+----+----
 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

Then you make a 2D boolean array that tells whether you have a certain letter transition available:

     |  0  1  2  3  4  5  6  7  8  9 10 11  <- from letter
     |  b  c  d  e  f  g  h  k  l  m  o  p
-----+--------------------------------------
 0 b |     T     T     T  T     
 1 c |  T     T  T     T  T
 2 d |     T           T  T
 3 e |  T  T     T     T  T  T  T
 4 f |                       T  T     T  T
 5 g |  T  T  T  T        T  T  T
 6 h |  T  T  T  T     T     T  T
 7 k |           T  T  T  T     T     T  T
 8 l |           T  T  T  T  T  T  T  T  T
 9 m |                          T     T
10 o |              T        T  T  T
11 p |              T        T  T
 ^
 to letter

Now go through your word list and convert the words to transitions (you can pre-compute this):

hello (6, 3, 8, 8, 10):
6 -> 3, 3 -> 8, 8 -> 8, 8 -> 10

Then check if these transitions are allowed by looking them up in your table:

[6][3] : T
[3][8] : T
[8][8] : T
[8][10] : T

If they are all allowed, there's a chance that this word might be found.

For example the word "helmet" can be ruled out on the 4th transition (m to e: helMEt), since that entry in your table is false.

And the word hamster can be ruled out, since the first (h to a) transition is not allowed (doesn't even exist in your table).

Now, for the remaining words that you didn't eliminate, try to actually find them in the grid the way you're doing it now or as suggested in some of the other answers here. This is to avoid false positives that result from jumps between identical letters in your grid. For example the word "help" is allowed by the table, but not by the grid

Let me know when your boggle phone-app is done! ;)

score 0 · Answer 2 · edited May 23 '17 at 12:33

0

One approach to investigate is to generate all the possible sequences of letters (strings) from the grid, then check if each word exists in this set of strings, rather than checking each word against the grid. E.g. starting at h in your first grid:

h
hb
he
he // duplicate, but different path
hbc
hbg
hbe
hbe // ditto
heb
hec
heg
...

This is only likely to be faster for very large lists of words because of the overhead of generating the sequences. For small lists of words it would be much faster to test them individually against the grid.

You would either need to store the entire path (including coordinates) or have a separate step to work out the path for the words that match. Which is faster will depend on the hit rate (i.e. what proportion of input words you actually find in the grid).

Depending on what you need to achieve, you could perhaps compare the sequences against a list of dictionary words to eliminate the non-words before beginning the matching.

Update 2 in the linked question there are several working, fast solutions that generate sequences from the grid, deepening recursively to generate longer sequences. However, they test these against a Trie generated from the words list, which enables them to abandon a subtree of sequences early - this prunes the search and improves efficiency greatly. This has a similar effect to the transition filtering suggested by Markus.

edited May 23 '17 at 12:33

Community

1
1

answered Nov 10 '12 at 22:09

DNA

42,007
12
107
146

1

If the grid is 4x4 (16 possible starting points), diagonal moves are allowed (on average 4.25 choices for next letter), and the longest word has 10 letters, there will be around 7 million possible combinations (16 x 4.25^9)... So it should definitely be faster to go through the list of Strings and trying to match them in the grid. – Markus A. Nov 10 '12 at 22:19
How many combinations do you need to test in that case, for 100,000 input words? My maths is weak at this time of night, but would 100000 * 16 * 4.25^4 be about right? (500 million) assuming an average word length of 5 – DNA Nov 10 '12 at 22:31
Definitely much less, since you would rule out options letter by letter rather than following all possible paths, whereas in your idea, you would indeed have to check all 7 million possibilities against each of the 100000 words. – Markus A. Nov 10 '12 at 22:33
Good point, it would be a lot fewer as you could eliminate some words as you proceed. But no, you don't have to check 7M * 100K (not by brute force) - you'd do this using an O(1) HashSet lookup, or similar. Once you've generated the 7M possibilities, checking the 100K words against them is very fast. I've updated my answer to suggest a hybrid method. – DNA Nov 10 '12 at 22:42

score 0 · Answer 3 · answered Nov 10 '12 at 22:14

Although I am sure there is a beatiful and efficient answer for this question academically, you can use the same approach, but with a list possibilities. so, for the word 'hello', when you find the letter 'h', next you will add possible 'e' letters and so on. Every possibility will form a path of letters.

score 0 · Answer 4 · answered Nov 10 '12 at 23:06

I would start by thinking of your grid as a graph, where each grid position is a node and each node connect to its eight neighbors (you shouldn't need to explicitly code it as a graph in code, however). Once you find the potential starting letters, all you need to do is to do a depth first search of the graph from each start position. The key is to remember where you've already searched, so you don't end up making more work for yourself (or worse, get stuck in a cycle).

Depending on the size of character space being used, you might also be able to benefit from building lookup tables. Let's assume English (26 contiguous character codepoints); if you start by building a 26-element List<Point>[] array, you can populate that array once from your grid, and then can quickly get a list of locations to start your search for any word. For example, to get the locations of h I would write arr['h'-'a']

You can even leverage this further if you apply the same strategy and build lookup tables for each edge list in the graph. Instead of having to search all 8 edges for each node, you already know which edges to search (if any).

(One note - if your character space is non-contiguous, you can still do a lookup table, but you'll need to use a HashMap<Character,List<Point>> and map.get('h') instead.)

Searching words in array

4 Answers4