0

I have this code to search a document and save the sentences to a ArrayList<StringBuffer> and save this object in a file

public static void save(String doc_path) {
    StringBuffer text  = new StringBuffer(new Corpus().createDocument(doc_path + ".txt").getDocStr());
    ArrayList<StringBuffer> lines = new ArrayList();
    Matcher matcher = compile("(?<=\n).*").matcher(text);
    while (matcher.find()) { 
        String line_str = matcher.group();
        if (checkSentenceLine(line_str)){
            lines.add(new StringBuffer(line_str));
        }          
    }
    FilePersistence.save (lines, doc_path + ".lin");  
    FilePersistence.save (lines.toString(), doc_path + "_extracoes.txt");
}

Corpus

public Document createDocument(String file_path) {
    File file = new File(file_path);
    if (file.isFile()) {
        return new Document(file);
    } else {
        Message.displayError("file path is not OK");
        return null;
    }
}

FilePersistence

public static void save (Object object_root, String file_path){
    if (object_root == null) return;
    try{
        ObjectOutputStream output = new ObjectOutputStream(new FileOutputStream (file_path));
        output.writeObject(object_root);
        output.close();
    } catch (Exception exception){
        System.out.println("Fail to save file: " + file_path + " --- " + exception);
    }
}

public static Object load (String file_path){
    try{            
        ObjectInputStream input = new ObjectInputStream(new FileInputStream (file_path));
        Object object_root = input.readObject();
        return object_root;
    }catch (Exception exception){
        System.out.println("Fail to load file: " + file_path + " --- " + exception);
        return null;
    }
}

the problem is, the document has some right single quotation characters as apostrophes, and when I load it and print on screen I get some odd squares instead of apostrophes on netBeans and Â' if I open the file on notepad and this is preventing me to properly handle the extracted sentences or at least showing them properly. At first I thought it was due to encoding incompatibility.

Then I tried changing encoding on project properties to CP1252 but it only changes the blank squares to question marks and on notepad still the same Â'

I also tried using

String line_str = matcher.group().replace("’","'")

and

String line_str = matcher.group().replace('\u2019','\')

but it does nothing

Update:

if (checkSentenceLine(line_str)){
        System.out.println(line_str);
        lines.add(new StringBuffer(line_str));
    }

This is before saving to a binary file. It already mess up the single quotes. shows as blank squares in UTF8 and as ? in CP1252. Makes me think the problem is when reading from the .txt

weird thing is that if i do this:

System.out.println('\u2019');

shows a perfect right single quote. the problem is only when reading from a .txt file, which makes me think it's a problem with the method I'm using to read from file. It also happens to bullet point symbols.

Maybe the problem is when parsing StringBuffer to String? if so, how could I prevent this from happening?

  • ANSI is a wrong name for a codepage. As http://stackoverflow.com/a/16084124/3897333 says, it's CP1252 – Paul Stelian Jul 08 '16 at 19:03
  • Serialized objects (that is, objects written with ObjectOutputStream) are binary data. They are not meant to be viewed in a text editor like Notepad. You haven’t shown your “print on screen” code, but if you’re just using System.out.println, that single quote character should show up in a NetBeans output area on all modern operating systems. – VGR Jul 08 '16 at 21:42
  • @vgr I updated the description – Mário Garcia Jul 09 '16 at 20:06
  • The fact that `println('\u2019')` works is useful information. That means the problem is almost certainly in your `Document` class, which probably is using the wrong Charset to read the file. Note that failing to specify any Charset will cause most Java APIs to use the underlying system’s default Charset, which is usually not the desired behavior. – VGR Jul 10 '16 at 02:24

0 Answers0