3

I have a file that can be read as a text box, I would like to get only the data available after

start="n= and end="n=

 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "SMIL10.dtd">
 <head>
 </head>
     <body>
            <audio start="n=10.815s" end="n=19.914s"/>
 </body>
</xml>

I tried doing the following :

   String startTime = readString.replaceAll(".*start=\"n=|\\s.*", "").trim();
   String endTime = readString.replaceAll(".*end=\"n=|\\s.*", "").trim();
   Log.e("Start Time is :" , startTime);
   Log.e("endTime Time is :" , endTime);

Its working fine, with just getting the start time and end time but it also shows the <?xml tag.

How do I fix this?

MByD
  • 135,866
  • 28
  • 264
  • 277
Adarsh H S
  • 1,208
  • 6
  • 21
  • 42

4 Answers4

3

I would rather use an XML parser to read this. Regexps aren't suited to parsing XML/HTML etc. You'll find numerous references in SO relating to this.

For Java, DOM and SAX are possibilities, but JDOM might make an easier starting point.

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
2

Please find the solution below in Java, this works for any data that contains the string

<audio start="n=........" end="n=......." />

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
public static void main(String[] args) 
{
String inputData1 = "<?xml version=\"1.0\" encoding=\"utf-8\"?>"+
                        "<!DOCTYPE smil PUBLIC \"-//W3C//DTD SMIL 1.0//EN\" \"SMIL10.dtd\">"
                        + "<head>" 
                        + "</head>" 
                        + "<body>"
                        + "<audio start=\"n=10.815s\" end=\"n=19.914s\"/>"
                        + "<sometag> <audio start=\"n=10.815s\" end=\"n=20.914s\"/> </sometag>"
                        + "</body>"
                        + "</xml>";

    String inputData2 = "some data goes here with or without tags; <audio start=\"n=10.815s\" end=\"n=20.914s\"/>; askjdhfla ";

    Pattern pattern = Pattern.compile("<audio[^>]*start\\s*=\\s*\"n\\s*=\\s*([^\"]*)\"[^>]*end=\"n\\s*=\\s*([^\"]*)\"[^>]*>");
    Matcher matcher = pattern.matcher(inputData1);

    while(matcher.find()){
        System.out.println("start=\"n="+matcher.group(1)+", & end=\"n="+matcher.group(2)+"");
    }

}
}

Output For InputData1:
start="n=10.815s, & end="n=19.914s
start="n=10.815s, & end="n=20.914s


Output For InputData2:
start="n=10.815s, & end="n=20.914s
Santhosh Gutta
  • 346
  • 2
  • 5
1

I'm joining to the previous answers. But if your file is always small, just a few strings, you may use a Regexp. In this case try this pattern: (\n|\r|.)*end\s*=\s*\"n=(.*)\"(\n|\r|.)*"

UPD: Group #2 will give you exactly you want.

Andremoniy
  • 34,031
  • 20
  • 135
  • 241
1

it is always the best way to parse xml/html by a parser, not regex. however regarding your problem. you could try following:

String s = "foo\n <audio start=\"n=10.815s\" end=\"n=19.914s\"/>bar\n";
String re = "(?s).*?(?<=start=\"n=)([^\"]*).*";
String startTime=s.replaceAll(re, "$1");

the example above will give 10.815s to String startTime. If you want to get endTime, replace the re (start) with (end)

short explanation about the regex:

(?s) is flag dotall, which means, the regex will match new lines as well
(?<=start=\"n=)([^\"]*) this is look behind. 
                        search for text following start="n=
                        and not "(double quote) in this case is 10.815s

hope it helps

Kent
  • 189,393
  • 32
  • 233
  • 301