3

I have a file containing N-Quads (using the schema.org vocabulary) and I want to load it into a TDB RDF-store, using Apache Jena's command line tools. The command that I'm using is:

tdbloader --loc <rdf_store_location> <file_to_load>

But during the loading, I got an error:

[line: 769293, col: 154] Illegal unicode escape sequence value: \" (0x22)

I also ran the validation tool from Jena command line tools:

riot --validate <file_to_load>

and indeed, there are at least 30 errors/warnings similar to that:

Bad IRI

The path contains a segment /../ not at the beginning of a relative reference, or it contains a /./ These should be removed

Is there a way to ignore invalid N-Quads, or to delete them, by using the command line tools (Jena or if you have knowledge of other)?

Otherwise the only option would be to do a script to remove the invalid characters. But besides the file is huge (60 GB), I guess this is very prone to errors.

unor
  • 92,415
  • 26
  • 211
  • 360
myName
  • 41
  • 3
  • 3
    It is good to check beore loading because of bad or unsuitable data. N-Quads is line based. Skip a triple is remove that line. Use the text editting tools of your OS : on Linux, "sed", "perl" etc will be able to find and skip lines in error or fix them. – AndyS Jun 22 '17 at 22:26
  • Thank you, @AndyS! I will remove the bad lines then – myName Jun 22 '17 at 23:00

0 Answers0