0

Is bag-of-words the same thing as document term matrix?

I have a training data set that consists of many files. I want to read all of them into a data structure (hash map?) to create a bag-of-words model for a particular class of documents, either science, religion, sports, or sex, in preparation for a perceptron implementation.

Right now I have the simplest of simple Java I/o constructs, I.e.

    String text; 
    BufferedReader br = new BufferedReader(new FileReader("file"));

    while ((text = br.readLine()) != null) 
    {
        //read in multiple files
        //generate a hash map with each unique word
        //as a key and the frequency with which that
        //word appears as the value
    }

So what I want to do is read input from multiple files in a directory and save all the data to one underlying structure, how to do that? Should I write it out to a file somewhere?

I think a hashmap, as I described in the comments of the code above would work, based on my understanding of bag-of-words. Is that right? How could I implement such a thing to sych with the reading of input from multiple files. How should I store it so I can later incorporate that into my perceptron algorithm?

I've seen this done like so:

  String names = new String[]{"a.txt", "b.txt", "c.txt"};
  StringBuffer strContent = new StringBuffer("");

  for (String name : names) {
      File file = new File(name); 
      int ch;
      FileInputStream stream = null;  
      try {
          stream = new FileInputStream(file);   
          while( (ch = stream.read()) != -1) {
          strContent.append((char) ch); 
          }
      } finally {
          stream.close();  
      } 
   }

But this is a lame solution because you need to specify in advance all the files, I think that should be more dynamic. If possible.

Community
  • 1
  • 1
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • Bag of words refers to just simply indexing and storing all the words that are present with no model taking into account the relative positions of the words with each other, a document term matrix is just a structure for referencing what terms appear in a document, think of this like the index in a book – EdChum Feb 16 '15 at 09:13

2 Answers2

1

You can try below program, its dynamic, you just need to provide your directory path.

public class BagOfWords {

ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();

public static void main(String[] args) throws IOException {
    File file = new File("F:/Downloads/Build/");
    new BagOfWords().iterateDirectory(file);
}

private void iterateDirectory(File file) throws IOException {
    for (File f : file.listFiles()) {
        if (f.isDirectory()) {
            iterateDirectory(file);
        } else {
            // Read File
            // Split and put it in a set
            // add to map
        }
    }
}

}

Saravana
  • 12,647
  • 2
  • 39
  • 57
0

I think this is very close but there's some kind of discrepency with int and integer how to reconcile that?

ConcurrentHashMap> map = new ConcurrentHashMap>();

        public static void main(String[] args) throws IOException 
        {
            String path = "path";
            File file = new File( path );
            new BagOfWords().iterateDirectory(file);
        }    

        private void iterateDirectory(File file) throws IOException 
        {
            for (File f : file.listFiles()) 
            {
                if (f.isDirectory()) 
                {
                    iterateDirectory(file);
                } 
                else 
                {

                    String line; 
                    BufferedReader br = new BufferedReader(new FileReader("file"));

                    while ((line = br.readLine()) != null) 
                    {

                        String[] words = line.split(" ");//those are your words

                        // Read File
                        // Split and put it in a set
                        // add to map
                        String word;

                        for (int i = 0; i < words.length; i++) 
                        {
                            word = words[i];
                            if (!map.containsKey(word))
                            {
                                map.put(word, 0);
                            }
                            map.put(word, map.get(word) + 1);
                        }

                    }

                }
            }
        }
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72