1

So I have a long list of words like this and based on the first space I want to split the words into word-meaning. Basically I am using Apache POI for this as I have to read the docx file and then fetch the data from it.

    abash  humiliate, embarrass
    abdicate  relinquish power or position
    aberrant  abnormal
    abet  aid, encourage (typically of crime)
    abeyance  postponement
    aboriginal  indigenous 
    abridge  shorten
    abstemious  moderate
...

So what regex would suit my purpose so that I can display it like:

word :abash
meaning : humiliate, embarrass
...

MY code is :

public class WordFileReader {

    /**
     * @param args
     */
    public static void main(String[] args) {
         try {
                FileInputStream fis = new FileInputStream("E:\\important.docx");
                org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor = new XWPFWordExtractor(new XWPFDocument(fis));
                System.out.print(oleTextExtractor.getText());            
            } catch (Exception e) {
                    e.printStackTrace();
            }

    }

}

--Edit-- Based on a suggested answer I am using this

public static void main(String[] args) {
         try {
                FileInputStream fis = new FileInputStream("E:\\Words.docx");
                org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor = new XWPFWordExtractor(new XWPFDocument(fis));
                //System.out.print(oleTextExtractor.getText());

                Scanner sc = new Scanner(oleTextExtractor.getText());            
                while(sc.hasNextLine()) {
                 String line = sc.nextLine();
                 int i = line.indexOf(' ');
                 String word = line.substring(0, i);
                 String meaning = line.substring(i).trim();

                 System.out.println("word "+word);
                 System.out.println("meaning "+meaning);
                }

            } catch (Exception e) {
                    e.printStackTrace();
            }

    }

But i get

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(Unknown Source)
    at WordFileReader.main(WordFileReader.java:25)
Prateek
  • 3,923
  • 6
  • 41
  • 79

4 Answers4

3

I would use java.util.Scanner to extract lines from text

Scanner sc = new Scanner(oleTextExtractor.getText());            
while(sc.hasNextLine()) {
    String line = sc.nextLine();
    ...

then I would split the line into word and meaning

 int i = line.indexOf(' ', 2);  // start from pos 2 to avoid a article
 String word = txt.substring(0, i);
 String meaning = txt.substring(i).trim();

or

 String[] parts = line.split("(?<!^a)\\s+", 2);
 String word = parts[0];
 String meaning = parts[1];
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(Unknown Source) at WordFileReader.main(WordFileReader.java:25) – Prateek Jun 10 '13 at 08:48
  • thit means a line with no ' ' in it, possibly empty – Evgeniy Dorofeev Jun 10 '13 at 08:51
  • How can I escape such a line and move forward – Prateek Jun 10 '13 at 08:54
  • I added `if(line.contains(" "))` condition before substing contructs and it works – Prateek Jun 10 '13 at 08:56
  • just missed one more thing i.e. there are a few words only in case where they start with a but are treated as one word like: word is `a cappella` and meaning `without accompaniment` it now displays like a-cappella without accompaniment – Prateek Jun 10 '13 at 09:02
  • i see, try int i = line.indexOf(' ', 2); – Evgeniy Dorofeev Jun 10 '13 at 09:08
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/31520/discussion-between-prateek-and-evgeniy-dorofeev) – Prateek Jun 10 '13 at 10:56
1

Use java.lang.String.split(String regex, int limit):

String[] parts = line.split("\\s", 1)
String word = parts[0];
String meaning = parts[1];
0

You can use substring as follows:

int index = line.indexOf(" ");

"word : "+ line.substring(0, index)+"\n Meaning : "+line.substring(index+1)

Harish Kumar
  • 528
  • 2
  • 15
0

Below code works fine for me..I used BufferedReader to read the text from file.

BufferedReader br=null;
    try {
        br = new BufferedReader(new FileReader("C:\\test.txt"));
    } catch (FileNotFoundException ex) {
        Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
    }
try {
    StringBuilder sb = new StringBuilder();
    String line="";
    String [] parts=null;
    String everything="",word="",meaning="";
        try {
            line = br.readLine();
        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }

    while (line != null) {
        sb.append(line);

        parts= line.split(" ",2);
        word=parts[0];
        meaning=parts[1];

    System.out.println("word:"+word.toString());
    System.out.println("meaning:"+meaning.toString());

        sb.append("\n");
            try {
                line = br.readLine();
            } catch (IOException ex) {
                Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
            }
    }

} finally {
        try {
            br.close();

        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }
}
ridoy
  • 6,274
  • 2
  • 29
  • 60