0
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class TagText {
    public static void main(String[] args) throws IOException, ClassNotFoundException {
        // Initializing the tagger
        MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
        List<String> lines = new ArrayList<>();
        lines = new ReadCSV().readColumn("Tt2.csv", 4);
        for (String line : lines) {
            String tagged = tagger.tagString(line);
            System.out.println(tagged);
        }
    }
}

I'm trying to parse a CSV file and i have a character (BIN 10010111, —) value which i wanted to the text parser to ignore this character. How would i do that ?

user207421
  • 305,947
  • 44
  • 307
  • 483
  • 1
    `10010111b` is `0x97` is decimal `151` -- the "extended" ASCII code for an _Em dash_ but in Unicode, which Java uses, 0x97 is in the [C1 control](https://en.wikipedia.org/wiki/C0_and_C1_control_codes) [char range](http://stackoverflow.com/q/18410167/17300) and the proper unicode char is U+2014 — if you're not removing a plain dash you needn't remove an em-dash, but you have to read it in with the proper encoding (probably iso-8859-1) or translate it after reading it (0x97 -> 0x2014). I have a method that translates C0 + C1 ranges to proper unicode. See http://stackoverflow.com/questions/631406 – Stephen P Oct 14 '15 at 20:39

1 Answers1

0

So i guess you want to remove all special characters?

I guess it was sth like: replaceAll("[^\w\s]", "");

Edit: Full Code

import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class TagText {
    public static void main(String[] args) throws IOException, ClassNotFoundException {
        // Initializing the tagger
        MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
        List<String> lines = new ArrayList<>();
        lines = new ReadCSV().readColumn("Tt2.csv", 4);
        for (String line : lines) {
            String tagged = tagger.tagString(line.replace("\uFFFD",""));
            System.out.println(tagged);
        }
    }
}
Friwi
  • 482
  • 3
  • 13