0

I'm trying to parse N3 DBpedia dump file using SemWeb. Here is my simple code:

Imports SemWeb
…
Dim store As New MemoryStore
Dim sr As New System.IO.StreamReader(chunkFile)
store.Import(New N3Reader(sr))

When I'm parsing the chunk file (which includes http://www.georss.org/georss/point predicates), I get this exception:

System.OverflowException: Value was either too large or too small for an Int32.
   at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
   at System.Xml.XmlConvert.ToInt32(String s)
   at SemWeb.Literal.ParseValue()
   at SemWeb.RdfReader.ValidateLiteral(Literal literal)
   at SemWeb.N3Reader.ReadToken(MyReader source, ParseContext context)
   at SemWeb.N3Reader.ReadResource2(ParseContext context, Boolean allowDirective, Boolean& reverse, Boolean& forgetBNode)
   at SemWeb.N3Reader.ReadResource(ParseContext context, Boolean allowDirective, Boolean& reverse, Boolean& forgetBNode)
   at SemWeb.N3Reader.ReadObject(Resource subject, Entity predicate, ParseContext context, Boolean reverse)
   at SemWeb.N3Reader.ReadPredicate(Resource subject, ParseContext context)
   at SemWeb.N3Reader.ReadPredicates(Resource subject, ParseContext context)
   at SemWeb.N3Reader.ReadStatement(ParseContext context)
   at SemWeb.N3Reader.Select(StatementSink store)
   at SemWeb.MemoryStore.StoreImpl.Import(StatementSource source)
   at SemWeb.Store.Import(StatementSource source)
   at ConsoleApplication2.Module1.SaveToDB(String chunkFilePath) in D:\ConsoleApplication2\ConsoleApplication2\Module1.vb:line 31

As I downloaded the file from DBPedia, it seems that they could not have any exception (they've been used many times by other parsers). Unfortunately, SemWeb does not provide more detail about the line which makes the exception, so I can't find the exact line(s) which causes the exception. Is there any way to solve it?

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
Amir Pournasserian
  • 1,600
  • 5
  • 22
  • 46
  • Have you tried loading the file by passing the file name to N3Reader, without using a StreamReader? Or perhaps you could try dotNetRDF, as explained in http://stackoverflow.com/questions/3083625/semweb-library-rdf-parser-for-c-sharp – Ben Companjen Mar 27 '13 at 14:17
  • Yes, I also tried dotNetRDF. there are some urls in DBPedia dump file which is not in correct format and it causes the exception. Finally I splitted the file to the small chunk files and I could find the exception. I've also implemented my own parser to skip the invalid triples (using dotNetRDF) – Amir Pournasserian Mar 29 '13 at 05:21
  • @JoshuaTaylor I think there is a format which SemWeb or dotNetRDF don't accept it. There are a few lines in dunp file's GIS data section (not in all dump files). Try parsing the file with sample code above. I just want to know if I made a mistake in my code. I'm sure that the dump file's formats is correct (as you mentioned). – Amir Pournasserian Jun 29 '13 at 03:30
  • I haven't used dotNetRDF, so I can't try your code, unfortunately, but it _looks_ OK to me (for what that's worth). In your first comment you said that there are some URLs that are not in the correct format, and that you split the file to find the problematic ones. Can you _show_ what the problematic ones were so that other SO users will benefit from this knowledge? Those datasets are a bit too large to just download and experiment with conveniently. From the stacktrace, it _looks_ like the parser was trying to get something into a 32-bit integer when it wouldn't fit. It would be useful… – Joshua Taylor Jun 29 '13 at 13:15
  • …to know whether the parser trying to fit it into a 32-bit integer is a result of the _data_ saying “this is a 32-bit integer” and being incorrect, or the parser _assuming_ “this should fit into a 32-bit integer.” – Joshua Taylor Jun 29 '13 at 13:16
  • @AmirPournasserian Could you describe what the problematic triple was and post and accept an answer about it? If there is malformed data in the DBpedia dumps, you're probably not the only one who will encounter it, so it could be beneficial to others. In addition, we'll have one less question with the rdf tag with no answers. :) – Joshua Taylor Sep 11 '13 at 12:54
  • @JoshuaTaylor I've implemented my own NTriplesParser to handle the exception. Try dotNetRDF to read "Ontology Infobox Properties" dump file. – Amir Pournasserian Sep 17 '13 at 04:31

0 Answers0