I am trying to create a TreeMap<String,List<String,Integer>>
.
The conditions are
- If a word is not existing: insert the word into the treemap and associate the word with an ArrayList(docId, Count).
- If the word is present in the TreeMap, then check if the current DocID matches within the ArrayList and then increase the count.
Below is the code I am using.
public class StemTreeMap
{
private static final String r1 = "\\$DOC";
private static final String r2 = "\\$TITLE";
private static final String r3 = "\\$TEXT";
private static Pattern p1,p2,p3;
private static Matcher m1,m2,m3;
public static void main(String[] args)
{
BufferedReader rd,rd1;
String docid = null;
String id;
int tf = 0;
//CountPerDocument cp = new CountPerDocument(docid, count);
List<CountPerDocument> ls = new ArrayList<>();
Map<String,List<CountPerDocument>> mp = new TreeMap<>();
try
{
rd = new BufferedReader(new FileReader(args[0]));
rd1= new BufferedReader(new FileReader(args[0]));
int docCount = 0;
String line = rd.readLine();
p1 = Pattern.compile(r1);
p2 = Pattern.compile(r2);
p3 = Pattern.compile(r3);
while(line != null)
{
m1 = p1.matcher(line);
m2 = p2.matcher(line);
m3 = p3.matcher(line);
if(m1.find())
{
docid = line.substring(5, line.length());
docCount++;
//System.out.println("The Document ID is :");
//System.out.println(docid);
line = rd.readLine();
}
else if(m2.find()||m3.find())
{
line = rd.readLine();
}
else
{
if(!(mp.containsKey(line))) // if the stem is not on the TreeMap
{
//System.out.println("The stem is not present in the tree");
//System.out.println("The stem is not present in the tree: " + line + " The Document is :" + docid);
tf = 1;
ls.add(new CountPerDocument(docid,tf));
mp.put(line, ls);
System.out.println("Inserted string is: "+ mp.get(line));
line = rd.readLine();
}
else
{
if(ls.indexOf(docid) > 0) //if its last entry matches the current document number
{
//System.out.println("The Stem is present for the same docid so incrementing docid: " +line + ":"+ docid);
tf = tf+1;
ls.add(new CountPerDocument(docid,tf));
line = rd.readLine();
}
else
{
//System.out.println("Stem is present but not the same docid so inserting new docid: "+line + ":"+ docid);
tf = 1;
ls.add(new CountPerDocument(docid,tf)); //set did to the current document number and tf to 1
line = rd.readLine();
}
}
}
}
rd.close();
System.out.println("The Number of Documents in the file is:"+ docCount);
//Write to an output file
String l = rd1.readLine();
File f = new File("dictionary.txt");
if (f.createNewFile())
{
System.out.println("File created: " + f.getName());
}
else
{
System.out.println("File already exists.");
Path path = Paths.get("dictionary.txt");
Files.deleteIfExists(path);
System.out.println("Deleted Existing File:: Creating New File");
f.createNewFile();
}
FileWriter fw = new FileWriter("dictionary.txt");
fw.write("The Total Number of Stems: " + mp.size() +"\n");
/*Set<Map.Entry<String,List<CountPerDocument>>> entries = mp.entrySet();
for(Map.Entry<String,List<CountPerDocument>> entry : entries)
{
fw.write(entry.getKey() + entry.getValue());
} */
Iterator<Map.Entry<String, List<CountPerDocument>>> iterator = mp.entrySet().iterator();
Map.Entry<String, List<CountPerDocument>> entry = null;
while(iterator.hasNext())
{
entry = iterator.next();
fw.write(entry.getKey() + "=>" + entry.getValue() + "\n" );
}
//System.out.println(mp.get("todai"));
fw.close();
}catch(IOException e)
{
e.printStackTrace();
}
}
}
For creating the ArrayList I am using the class
public class CountPerDocument
{
private final String documentId;
private final int count;
CountPerDocument(String documentId, int count)
{
this.documentId = documentId;
this.count = count;
}
public String getDocumentId()
{
return this.documentId;
}
public int getCount()
{
return this.count;
}
@Override
public String toString()
{
return this.documentId + "-" + this.count;
}
}
When I tried to print what I was inserting into the map by printing mp.get(line)
, the output I get is as below:
Stem is: attempt
DocId is: LA010190-0002TF is : 1
Inserted string is: [LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0001-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1, LA010190-0002-1]
I'm not sure why so many are being inserted. Am I printing the output wrong, or is there something wrong with the method that I chose?