I am trying to write a regular expression to match an xml document. Reason I am not using a xml parser immediately is because the file might contain multiple xml files (well formed or not), so I would like to remove not well formed before parsing.
xml structure:
<company>
.....
<Employees>
.......
</Employees>
</company>
code
final String xmlString = "...";
final List<String> data = new ArrayList<String>();
try
{
final Pattern pattern = Pattern.compile("<company>(.+?)</company>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(xmlString);
while (matcher.find())
{
final Pattern pattern1 = Pattern.compile("<Employees>(.+?)</Employees>", Pattern.DOTALL);// "+?"
final Matcher matcher1 = pattern1.matcher(matcher.group(1));
if (matcher1.find())
{
data.add(matcher1.group(1));
}
}
}
catch (final Exception e)
{
}
This works fine if the xml string contains one well formed or not well formed xml string. but this doesn't work when you have a not well formed xml followed by well formed xml.
<company>
<Employees>
</Employees>
<company>
.....
<Employees>
.......
</Employees>
</company>
In this scenario it returns the whole string than the well formed xml.
Please help thanks!!