-1

I have the following code, it reads in many files from a directory into a hash map, this is my feature vecteur. It's somewhat naive in the sense that it does no stemming but that's not my primary concern right now. I want to know how I can use this data structure as the input to the perceptron algorithm. I guess we call this a bag of words, isn't it?

public class BagOfWords 
{
        static Map<String, Integer> bag_of_words = new HashMap<>();

        public static void main(String[] args) throws IOException 
        {
            String path = "/home/flavius/atheism;
            File file = new File( path );
            new BagOfWords().iterateDirectory(file);

            for (Map.Entry<String, Integer> entry : bag_of_words.entrySet()) 
            {
                System.out.println(entry.getKey()+" : "+entry.getValue());
            }

        }

        private void iterateDirectory(File file) throws IOException 
        {
            for (File f : file.listFiles()) 
            {
                if (f.isDirectory()) 
                {    
                    iterateDirectory(file);
                } 
                else 
                {
                    String line; 
                    BufferedReader br = new BufferedReader(new FileReader( f ));

                    while ((line = br.readLine()) != null) 
                    {

                        String[] words = line.split(" ");//those are your words

                        String word;

                        for (int i = 0; i < words.length; i++) 
                        {
                            word = words[i];
                            if (!bag_of_words.containsKey(word))
                            {
                                bag_of_words.put(word, 0);
                            }
                            bag_of_words.put(word, bag_of_words.get(word) + 1);
                        }

                    }

                }
            }
        }
    }

You can see that the path goes to a directory called 'atheism' there's also one called sports, I want to try to linearly seperate these two classes of documents, and then try to seperate the unseen test docs into either category.

How to do that? How to conceptualize that. I'd appreciate a solid reference, comprehensive explanation or some kind of pseudocode.

I've not found many informative and lucid references on the web.

smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • You need to vectorize your files (documents) into a vector representation, maybe you want to have a look at my vectorizer class: https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/nlp/VectorizerUtils.java#L256 – Thomas Jungblut Feb 16 '15 at 08:56
  • how is that different than the hash map I have now? – smatthewenglish Feb 16 '15 at 09:11
  • @ThomasJungblut I think that's what I'm already doing now, isn't it? – smatthewenglish Feb 16 '15 at 09:45
  • your `bag_of_words` is a dictionary, you need a bag of that per document (file) you parse. – Thomas Jungblut Feb 16 '15 at 10:21
  • ah, so right now I'm putting all of the files into one dictionary but really what I should be doing is creating a seperate one for each file? – smatthewenglish Feb 16 '15 at 10:30
  • You should create a global dictionary that gives you the dimension of your vectors, while you maintain a vector/document based "dictionary" that is the input to your perceptron. – Thomas Jungblut Feb 16 '15 at 10:58
  • can you write that in pseudocode, i've never done this before and I'm having trouble conceptualizing it. I think [this one](http://stackoverflow.com/questions/4663379/implementing-a-perceptron-classifier) might be helpful too – smatthewenglish Feb 16 '15 at 11:01
  • Let me know if you have any questions about my answer – Thomas Jungblut Feb 16 '15 at 11:46
  • @ThomasJungblut Herr Thomas, that's really great, thanks a million. How about that small clarification edit I just made? – smatthewenglish Feb 16 '15 at 12:26
  • I see, it got rejected, but I added the sensible equivalent. – Thomas Jungblut Feb 16 '15 at 13:06
  • like for athesism, i create a set for all the words from all the docs in atheism, and then i give each individual document within atheism a binary feature vector based on the number of words the y share with all the words in atheism? or i create a global dict. for all together, one containing atheism, sports, politics, and science and then score each individual document on the basis of that huge dict. – smatthewenglish Feb 16 '15 at 13:10
  • it's a global dictionary across all class labels. – Thomas Jungblut Feb 16 '15 at 13:21
  • could you check out [what I have so far](https://github.com/h1395010/bag_of_werds/tree/master/src/bag_of_werds) and let me know if you think I'm on the right track? I think I have the global dict down but I'm wondering what to do about those feature vectors. – smatthewenglish Feb 16 '15 at 13:29
  • `I'm wondering what to do about those feature vectors` what does that mean? Maybe you want to formulate a different question. – Thomas Jungblut Feb 16 '15 at 13:38

2 Answers2

1

Let's establish some vocabulary up front (I guess you are using the 20-newsgroup dataset):

  • "Class Label" is what you're trying to predict, in your binary case this is "atheism" vs. the rest
  • "Feature vector" that's what you input to your classifier
  • "Document" that is a single e-mail from the dataset
  • "Token" a fraction of a document, usually a unigram/bigram/trigram
  • "Dictionary" a set of "allowed" words for your vector

So the vectorization algorithm for bag of words usually follows the following steps:

  1. Go over all the documents (across all class labels) and collect all the tokens, this is your dictionary and the dimensionality of your feature vector
  2. Go over all the documents again and for each do:
    1. Create a new feature vector with the dimensionality of your dictionary (e.g. 200, for 200 entries in that dictionary)
    2. go over all the tokens in that document and set the word count (within this document) at this dimension of the feature vector
  3. You now have a list of feature vectors that you can feed into your algorithm

Example:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

Dictionary is:

["I", "am", "awesome", "great"]

So the documents as a vector would look like:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

And with that you can do all kinds of fancy math stuff and feed this into your perceptron.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
0

This is the full and complete answer to my original question, posted here for the benefit of future perusers


Given the following files:

  • atheism/a_0.txt

    Gott ist tot.
    
  • politics/p_0.txt

    L'Etat, c'est moi , et aussi moi .
    
  • science/s_0.txt

    If I have seen further it is by standing on the shoulders of giants.
    
  • sports/s_1.txt

    You miss 100% of the shots you don't take.
    
  • Output data structures:

    /data/train/politics/p_0.txt, [0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
    /data/train/science/s_0.txt, [1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0]
    /data/train/atheism/a_0.txt, [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    /data/train/sports/s_1.txt, [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1]
    

The code looks like this, or you can find it on my GitHub page.

public class FileDictCreateur 
{
    static String PATH = "/home/matthias/Workbench/SUTD/ISTD_50.570/assignments/practice_data/data/train";

    //the global list of all words across all articles
    static Set<String> GLOBO_DICT = new HashSet<String>();

    //is the globo dict full?
    static boolean globo_dict_fixed = false;

    // hash map of all the words contained in individual files
    static Map<File, ArrayList<String> > fileDict = new HashMap<>();

    //input to perceptron. final struc.
    static Map<File, int[] > perceptron_input = new HashMap<>();


    @SuppressWarnings("rawtypes")
    public static void main(String[] args) throws IOException 
    {
        //each of the diferent categories
        String[] categories = { "/atheism", "/politics", "/science", "/sports"};

        //cycle through all categories once to populate the global dict
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            String general_data_partition = PATH + categories[cycle];

            File directory = new File( general_data_partition );
            iterateDirectory( directory , globo_dict_fixed);

            if(cycle == 3)
                globo_dict_fixed = true;
        }


        //cycle through again to populate the file dicts
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            String general_data_partition = PATH + categories[cycle];

            File directory = new File( general_data_partition );
            iterateDirectory( directory , globo_dict_fixed);

        }



        perceptron_data_struc_generateur( GLOBO_DICT, fileDict, perceptron_input );



        //print the output
        for (Map.Entry<File, int[]> entry : perceptron_input.entrySet()) 
        {
            System.out.println(entry.getKey() + ", " + Arrays.toString(entry.getValue()));
        }
    }



    private static void iterateDirectory(File directory, boolean globo_dict_fixed) throws IOException 
    {
        for (File file : directory.listFiles()) 
        {
            if (file.isDirectory()) 
            {
                iterateDirectory(directory, globo_dict_fixed);
            } 
            else 
            {   
                String line; 
                BufferedReader br = new BufferedReader(new FileReader( file ));

                while ((line = br.readLine()) != null) 
                {
                    String[] words = line.split(" ");//those are your words

                    if(globo_dict_fixed == false)
                    {
                        populate_globo_dict( words );
                    }
                    else
                    {
                        create_file_dict( file, words );
                    }
                }
            }
        }
    }

    @SuppressWarnings("unchecked")
    public static void create_file_dict( File file, String[] words ) throws IOException
    {   

        if (!fileDict.containsKey(file))
        {
            @SuppressWarnings("rawtypes")
            ArrayList document_words = new ArrayList<String>();

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];

                document_words.add(word);
            }
            fileDict.put(file, document_words);
        }
    }

    public static void populate_globo_dict( String[] words ) throws IOException
    {
        String word;

        for (int i = 0; i < words.length; i++) 
        {
            word = words[i];
            if (!GLOBO_DICT.contains(word))
            {
                GLOBO_DICT.add(word);
            }
        }   
    }

    public static void perceptron_data_struc_generateur(Set<String> GLOBO_DICT, 
                                                    Map<File,     ArrayList<String> > fileDict,
                                                    Map<File, int[] > perceptron_input)
    {
        //create a new entry in the array list 'perceptron_input'
        //with the key as the file name from fileDict
            //create a new array which is the length of GLOBO_DICT
            //iterate through the indicies of GLOBO_DICT
                //for all words in globo dict, if that word appears in fileDict,
                //increment the perceptron_input index that corresponds to that
                //word in GLOBO_DICT by the number of times that word appears in fileDict

        //so i can get the index later
        List<String> GLOBO_DICT_list = new ArrayList<>(GLOBO_DICT);

        for (Map.Entry<File, ArrayList<String>> entry : fileDict.entrySet()) 
        {
            int[] cross_czech = new int[GLOBO_DICT_list.size()];
            //initialize to zero
            Arrays.fill(cross_czech, 0);

            for (String s : GLOBO_DICT_list)
            {

                for(String st : entry.getValue()) 
                {
                    if( st.equals(s) )
                    {
                        cross_czech[ GLOBO_DICT_list.indexOf( s ) ] = cross_czech[ GLOBO_DICT_list.indexOf( s ) ] +1;
                    }
                }
            }
            perceptron_input.put( entry.getKey() , cross_czech);    
        }
    }
}
Chris Forrence
  • 10,042
  • 11
  • 48
  • 64
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • If you have a new question, please ask it by clicking the [Ask Question](http://stackoverflow.com/questions/ask) button. Include a link to this question if it helps provide context. – kenorb Feb 17 '15 at 14:23
  • yeah, but actually that really is the complete answer to my original question contained within there. it's sort of a two for one – smatthewenglish Feb 17 '15 at 14:29
  • @flavius_valens I'm going to go ahead and edit your answer then; this answer came up in the low quality queue, and by reading the first line, I can see it being very easy to dismiss as a new question. – Chris Forrence Feb 17 '15 at 17:27
  • @ChrisForrence do either of you know the answer to the question you were trying to eradicate? – smatthewenglish Feb 18 '15 at 04:52
  • @flavius_valens You'd need to ask Thomas that. It looks like you've already done the right thing: commenting on his answer. He should get a notification about that. As a note, mentioning people within an _answer_ doesn't notify anyone. Comments? Yes. – Chris Forrence Feb 18 '15 at 12:38