To begin with the XML file 2,84GB and none of SAX or DOM parser seems to be working. I've already tried them and every time crashes. So, I choose to read the file and export the data I want with BufferedReader, parsing the XML file like it is txt.
XML File(small part):
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>
<title>Ein Verfahren zur thematisch spezialisierten Suche im Web und seine Realisierung im Prototypen HomePageSearch</title>
<year>2002</year>
From that XML file I want to retrieve the data which is between the tags <year>
. I also used Pattern and Matcher with regEx to find out the information I want. My code so far:
public class Publications {
public static void main(String[] args) throws IOException {
File file = new File("dblp-2020-04-01.xml");
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
String regex = "\\d+";
// Reading line by line from the
// file until a null is returned
while ((line = reader.readLine()) != null) {
final Pattern pattern = Pattern.compile("<year>(.+?)</year>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<year>"+regex+"</year>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract
}
}
}
After compiling , the results aren't what I expected to be. Instead of printing me the exact year everytime the parser finds the ... tag the results are the following:
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
\d+
Any suggestions?