0

To begin with the XML file 2,84GB and none of SAX or DOM parser seems to be working. I've already tried them and every time crashes. So, I choose to read the file and export the data I want with BufferedReader, parsing the XML file like it is txt.

XML File(small part):

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>
<title>Ein Verfahren zur thematisch spezialisierten Suche im Web und seine Realisierung im Prototypen HomePageSearch</title>
<year>2002</year>

From that XML file I want to retrieve the data which is between the tags <year>. I also used Pattern and Matcher with regEx to find out the information I want. My code so far:

public class Publications {
    public static void main(String[] args) throws IOException {
        File file = new File("dblp-2020-04-01.xml");
        FileInputStream fileStream = new FileInputStream(file);
        InputStreamReader input = new InputStreamReader(fileStream);
        BufferedReader reader = new BufferedReader(input);
        String line;
        String regex = "\\d+";


        // Reading line by line from the
        // file until a null is returned
        while ((line = reader.readLine()) != null) {
            final Pattern pattern = Pattern.compile("<year>(.+?)</year>", Pattern.DOTALL);
            final Matcher matcher = pattern.matcher("<year>"+regex+"</year>");
            matcher.find();
            System.out.println(matcher.group(1)); // Prints String I want to extract
            }
        }
}

After compiling , the results aren't what I expected to be. Instead of printing me the exact year everytime the parser finds the ... tag the results are the following:

\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+

Any suggestions?

Johannes Kuhn
  • 14,778
  • 4
  • 49
  • 73
Michael P.
  • 15
  • 1
  • 6
  • 1
    "_tried them and every time crashes_" – could you elaborate? – Slaw Apr 22 '20 at 10:48
  • 1
    SAX parsers are designed for handling outsize data and should not crash due to sheer file size. Have you run a SAX parser _without_ building data structures in the course of processing? – collapsar Apr 22 '20 at 10:51
  • you dont use the actual line to match the regexp. you always match with `\d+` for each line – k5_ Apr 22 '20 at 10:52
  • Console output: "The parser has encountered more than "64000" entity expansions in this document; this is the limit imposed by the JDK." @Slaw – Michael P. Apr 22 '20 at 10:53
  • @k5_ What should I put instead of \d+? – Michael P. Apr 22 '20 at 10:56
  • Does this help at all? https://stackoverflow.com/q/21588619/6395627 – Slaw Apr 22 '20 at 10:57
  • Another source indicates a different system property: https://stackoverflow.com/q/20482331/6395627 – Slaw Apr 22 '20 at 11:02
  • @Slaw I'm going to be examined on that project into another computer. So, I have to find a general solution – Michael P. Apr 22 '20 at 11:02
  • Personally, I find setting a system property to _be_ a general solution. Depending on the life-cycle of your application you could probably even set the property in code via `System.setProperty(...)`. – Slaw Apr 22 '20 at 14:47

2 Answers2

2

Please don't try parsing XML using regular expressions. We get hundreds of questions on this forum from people trying to generate XML in peculiar formats because that's the only thing the receiving application can handle, and the reason the receiving application has such restrictions is that it's trying to do the XML parsing "by hand". You're storing up trouble for yourself, for the people you want to exchange data with, and for the people on StackOverflow that you will turn to for help when it all goes pear-shaped. XML standards exist for a reason, and work very well when everyone conforms to them.

The right approach in this case is a streaming XML approach, using SAX, StAX, or streaming XSLT 3.0, and you've abandoned those approaches for completely spurious reasons.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Remark

Regexen are the wrong tool to extract information from xml (or similar structured formats). The general approach is not recommended. For the right way to handle it, cf. Michael Kay's answer.

Answer

You provide the wrong argument in constructing the matcher. Instead of the expression in your code you need to provide the current line:

// ...
final Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
    System.out.println(matcher.group(1)); // Prints String I want to extract
}
// ...

Note the extra conditional to check whether the current line does match at all.

Also note that the pattern you match against is defined in the Pattern constructor. Thus to match only <year> tags that contain numerical values, the line has to be changed to

final Pattern pattern = Pattern.compile("<year>(" + regex + ")</year>", Pattern.DOTALL);
collapsar
  • 17,010
  • 4
  • 35
  • 61
  • It doesn't work perfectly. It handles the one test case that you've supplied. It's dead easy to construct a different test case for which it will fail. – Michael Kay Apr 22 '20 at 13:50
  • @MichaelKay You are perfectly right. Regexen are the wrong tool for the job. Unless all what is needed is a one-time quick & dirty solution and the proper tools aren't available or too much of a hassle to use - only the OP knows about that. I'll add a note to the answer though. – collapsar Apr 22 '20 at 14:21