1

I was tasked with writing a code that opens a text file, then searches for occurrences of the user's string in the text file and reports how many there were.

The code is below for what I have. It will search for word fragments, which is good, but the professor want it to search for bizzare fragments that have spaces and everything. Something like "of my" or "even g" or any other arbitrary string of characters.

My working code is below, I've been trying to make compareTo work, but I can't seem to get the syntax down. This professor insists on not being helpful and it's a summer class so not TA's to help. I've googled to no avail, it seems I can't put the problem into a decent set of word to search for.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

import javax.swing.*;

public class TextSearchFromFile 
{
public static void main(String[] args) throws FileNotFoundException 
{

    boolean run = true;
    int count = 0;


            //greet user
        JOptionPane.showMessageDialog(null, 
                "Hello, today you will be searching through a text file on the harddrive. \n"
                + "The Text File is a 300 page fantasy manuscript written by: Adam\n"
                + "This exercise was intended to have the user enter the file, but since \n"
                + "you, the user, don't know which file the text to search is that is a \n"
                + "bit difficult.\n\n"
                + "On the next window you will be prompted to enter a string of characters.\n"
                + "Feel free to enter that string and see if it is somewhere in 300 pages\n"
                + "and 102,133 words. Have fun.", 
                "Text Search", 
                JOptionPane.PLAIN_MESSAGE);

    while (run)
    {
        try
        {
                //open the file
            Scanner scanner = new Scanner(new File("An Everthrone Tale 1.txt"));

                //prompt user for word
            CharSequence findWord = JOptionPane.showInputDialog(null, 
                    "Enter the word to search for:", 
                    "Text Search", 
                    JOptionPane.PLAIN_MESSAGE);
            count = 0;


            while (scanner.hasNext())
            {

                if ((scanner.next()).contains(findWord))
                {
                    count++;
                }

            } //end search loop


                //output results to user
            JOptionPane.showMessageDialog(null, 
                    "The results of your search are as follows: \n"
                    + "Your String: " + findWord + "\n"
                    + "Was found: " + count + " times.\n"
                    + "Within the file: An Ever Throne Tale 1.txt", 
                    "Text Search",
                    JOptionPane.PLAIN_MESSAGE);
        } //end try
        catch (NullPointerException e)
        {
            JOptionPane.showMessageDialog(null, 
                    "Thank you for using the Text Search.", 
                    "Text Search", 
                    JOptionPane.ERROR_MESSAGE);
            System.exit(0);
        }
    } //end run loop
} // end main
} // end class

Just at a loss of how to make it search for crazy arbitrary pieces like that. He knows whats in the text file so he knows he can put sequences together like my examples above that can be found within the text, but they are not.

Adam
  • 61
  • 8
  • Do you mean search for word sequences like "of my" where the words are not on the same line? Like, where "of" is the last word on one line, and "my" is the first word on the next line? Because I can't see any reason why the code you've shown wouldn't work for strings that have spaces in them. – David Conrad Jul 24 '14 at 19:33
  • Is it possible to convert the entire text file into bytes, then convert the user's "findWord" into bytes and look for places where the series of bytes are equal in the code? – Adam Jul 25 '14 at 20:13
  • Just trying to see if that is a possible approach. Not that I know how to do that. – Adam Jul 25 '14 at 20:13
  • Ah, I didn't notice before, and I never use `Scanner`, but you're calling `next()` which just returns the next token (the next word, essentially). You should use `hasNextLine()` and `nextLine()`. That way, you could at least find matches on a single line. – David Conrad Jul 25 '14 at 20:18
  • I gave that a try yesterday during my testing and it still won't find the occurrences I know are there. – Adam Jul 25 '14 at 20:28
  • You need to carefully check exactly what is in `findWord`. Is it possible that the use is entering a word with a control character or other whitespace in it? Make sure it doesn't contain tabs, carriage returns, or line feeds. – David Conrad Jul 25 '14 at 20:51

2 Answers2

1

Don't use hasNext() and next() since those will only return a single token at a time from the input file, and you won't be able to find a multi-word phrase (or anything containing spaces). If you use hasNextLine() and nextLine() you can do a little better, but it still won't find cases where "of my" appears with "of" as the last word on one line, and "my" as the first word on the next line. To find that, you need a little more context.

If you keep track of the last line read from the file, you can create a two-line buffer and find instances that are spread across multiple lines.

String last = ""; // initially, last is empty

while (scanner.hasNextLine())
{

    String line = scanner.nextLine();
    String text = last + " " + line; // two-line buffer

    if (text.contains(findWord))
    {
        count++;
    }

    last = line; // remember the last line read

} //end search loop

This should find words broken across two lines, but there are still three problems. First, you could have a phrase like "three lines long" that is broken across three lines:

  three
  lines
  long

You would need to extend the two-line buffer concept to find this. Ultimately, you might need to have the entire file in memory at once, but I suspect that is enough of an edge case that you probably don't care about it.

Second, when words are found on a single line, you will count them twice. Once when the word first appears on the line being read, and a second time when it is in the last line, the previous time it has been read.

Third, using contains in this way won't find multiple copies of the same word on the same line. So if you are looking for "dog" and the following text appears:

  My dog saw a dog today at the dog park which was full of dogs.

The test with contains will only cause count to be incremented once. (But it would happen again when this line was in last.)

So I think you really need to 1. Read the entire file into a buffer, to find phrases split across any number of lines, and 2. Search through the lines using indexOf with an offset that increases until no more matches are found.

String text = "";

if (scanner.hasNextLine())
{
    text += scanner.nextLine(); // first line
}
while (scanner.hasNextLine())
{
    text += " "; // separate lines with a space
    text += scanner.nextLine();
}

int found, offset = 0; // start looking at the beginning, offset 0
while ((found = text.indexOf(findWord, offset)) != -1)
{
    count++; // found a match
    offset = found + 1; // look for next match after this match
}

If you don't care about matches broken across multiple lines, then you can do it one line at a time and avoid the memory cost of having the entire text in memory at once.

David Conrad
  • 15,432
  • 2
  • 42
  • 54
  • What you're saying makes sense, and we haven't covered buffers so it's not even a tool in my toolbox yet, but thank you. But I tried and theres no joy. Still not finding multiword or fragments. – Adam Jul 26 '14 at 01:18
  • Unfortunately the word fragments are an absolute must. – Adam Jul 26 '14 at 01:24
  • You need to check what is actually in `findWord`. I tried this on my machine with text that contained "of my" seven times (including one split across lines), and it worked. Are you allowed to use string concatenation? You can use that instead of StringBuilder. I'll edit. – David Conrad Jul 26 '14 at 01:25
  • I debugged and watched what fills findWord and it is exactly whatever is entered. If I enter: of my and step through I can see findWord filled with: of my --- I have tried other multiword strings and some are found and some are not. Like: For mom was found the 1 time I know it appears in the text however: and dad was searched separately and was not found although it appears in the same line as For mom. – Adam Jul 26 '14 at 01:35
  • 1
    Maybe check that the file only contains regular spaces, and not non-breaking spaces or something? I dunno, I'm stumped. – David Conrad Jul 26 '14 at 01:38
  • Those same results above were with your code and my original. Hit or miss on which it finds. – Adam Jul 26 '14 at 01:40
  • 1
    Holy crap! David. The entire thing is a manuscript. 99.9% of it is double spaced between the words. Never even thought of it until you said regular spaces. I feel so dumb and am so sorry now. – Adam Jul 26 '14 at 01:46
  • Thank you very much David. Even my original works (with the double counting but I'm not so worried about that) considering the double spacing between words. – Adam Jul 26 '14 at 01:55
  • Ha! That's always tricky. You think you're looking at one thing but whitespace is so ... invisible. Glad you figured it out. – David Conrad Jul 26 '14 at 02:00
  • 1
    I'm going to use your code above. It takes my older netbook awhile to run through, but the count is a lot more accurate. Thanks again! I've been beating myself up for a couple days on this, I really felt what I had done originally should've worked, but of course it would be something so right in front of my face. – Adam Jul 26 '14 at 02:06
0

Do something on the lines of -

  1. Get the string as it is. Dont split it by space or anything.
  2. use the indexOf on the string. Once a match has been found, start from the place

    int index = word.indexOf(guess); while (index >= 0) { System.out.println(index); index = word.indexOf(guess, index + 1); }

Indexes of all occurrences of character in a string

Community
  • 1
  • 1
ND27
  • 447
  • 1
  • 5
  • 16
  • In my text file there are a string of characters that reads "of my" within a sentence. If I put: of my I to my search field for the findWord it will return zero on the counter. It will not find the instances I know are there. In my text file there is a string where the sentence reads "even gods" an if I put: even g (a word a space and a fragment) into my search field it will return zero on the counter. However if I put a regular single word or character in it will find them. It will even find fragments like "th" or "en". – Adam Jul 24 '14 at 20:04
  • Make sure you dont split the words in your search field by anything. ignore case : ["Of My"] In your main string - search string. It will take "of my" as a string and wont care if there is a space or not. – ND27 Jul 28 '14 at 17:48