-1

I have the dataset of reviews of products and I want to extract text between text from that file and print.How can I extract Data File contains data in the following format

<review> id 
<reviewer></reviewer> 
<start word></end word> 
</review>

my code is like

File file = new File("D://Data/Dataset/unlabeled.review");
    FileInputStream fis = new FileInputStream(file);
    byte[] bytes = new byte[(int) file.length()];
    fis.read(bytes);
    fis.close();
    String text = new String(bytes, "UTF-8");
    System.out.println(text.substring(text.indexOf("<start word>"), text.lastIndexOf("</end word>")));
Vishal Kawade
  • 449
  • 6
  • 20
  • 1
    With some code.. What did you try? –  Mar 01 '16 at 12:21
  • see http://stackoverflow.com/questions/34129040/simple-way-to-extract-data-from-xml-with-java for example –  Mar 01 '16 at 12:26

1 Answers1

1

Your extraction code is this:

    text.substring(text.indexOf("<review_text>"), 
                   text.lastIndexOf("</review_text>"));

There are three problems with this code:

  1. The indexOf and lastIndexOf methods return the offset of the first character of some occurrence of the argument string. But you need to extract from the first character after "".

  2. If there are multiple "<review_text>" / "</review_text>" pairs, then your code doesn't extract the the text between each pair.

  3. If there is no "<review_text>" or no "</review_text>", then one or both of the index-of calls will return -1, and that will lead to an exception in the substring call.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216