I have this code to search a document and save the sentences to a ArrayList<StringBuffer>
and save this object in a file
public static void save(String doc_path) {
StringBuffer text = new StringBuffer(new Corpus().createDocument(doc_path + ".txt").getDocStr());
ArrayList<StringBuffer> lines = new ArrayList();
Matcher matcher = compile("(?<=\n).*").matcher(text);
while (matcher.find()) {
String line_str = matcher.group();
if (checkSentenceLine(line_str)){
lines.add(new StringBuffer(line_str));
}
}
FilePersistence.save (lines, doc_path + ".lin");
FilePersistence.save (lines.toString(), doc_path + "_extracoes.txt");
}
Corpus
public Document createDocument(String file_path) {
File file = new File(file_path);
if (file.isFile()) {
return new Document(file);
} else {
Message.displayError("file path is not OK");
return null;
}
}
FilePersistence
public static void save (Object object_root, String file_path){
if (object_root == null) return;
try{
ObjectOutputStream output = new ObjectOutputStream(new FileOutputStream (file_path));
output.writeObject(object_root);
output.close();
} catch (Exception exception){
System.out.println("Fail to save file: " + file_path + " --- " + exception);
}
}
public static Object load (String file_path){
try{
ObjectInputStream input = new ObjectInputStream(new FileInputStream (file_path));
Object object_root = input.readObject();
return object_root;
}catch (Exception exception){
System.out.println("Fail to load file: " + file_path + " --- " + exception);
return null;
}
}
the problem is, the document has some right single quotation characters as apostrophes, and when I load it and print on screen I get some odd squares instead of apostrophes on netBeans and Â' if I open the file on notepad and this is preventing me to properly handle the extracted sentences or at least showing them properly. At first I thought it was due to encoding incompatibility.
Then I tried changing encoding on project properties to CP1252 but it only changes the blank squares to question marks and on notepad still the same Â'
I also tried using
String line_str = matcher.group().replace("’","'")
and
String line_str = matcher.group().replace('\u2019','\')
but it does nothing
Update:
if (checkSentenceLine(line_str)){
System.out.println(line_str);
lines.add(new StringBuffer(line_str));
}
This is before saving to a binary file. It already mess up the single quotes. shows as blank squares in UTF8 and as ? in CP1252. Makes me think the problem is when reading from the .txt
weird thing is that if i do this:
System.out.println('\u2019');
shows a perfect right single quote. the problem is only when reading from a .txt file, which makes me think it's a problem with the method I'm using to read from file. It also happens to bullet point symbols.
Maybe the problem is when parsing StringBuffer to String? if so, how could I prevent this from happening?