0

I need to get the structure or scheme for a huge XML file (about 60 GB). Whats the best way to get all the attributes?

TobiasKnudsen
  • 527
  • 2
  • 9
  • 29

3 Answers3

0

Try reading the first few lines and check if it contains a scheme declaration. You could do that by simply matching on the string "<xs:schema " E.g. such as this

<?xml version="1.0"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://www.w3schools.com"
xmlns="https://www.w3schools.com"
elementFormDefault="qualified">
...
...
</xs:schema> 

Example from https://www.w3schools.com/xml/schema_schema.asp

GavinBrelstaff
  • 3,016
  • 2
  • 21
  • 39
0

I suggest that you start by staring at the XML document. Take the first megabyte of the document (initial sample) and add end tags, load into an XML editor. Make a few thoughts about how repetetive the data is.

Then use either an online schema generator or google and find a suitable library and generate a sample XML Schema. Then load the XML Schema into a streaming validator, for example like ValidationHandler in Java, and attempt to validate the whole document.

Do a few iterations of manually including any 'offending' XML fragments to the initial sample, regenerating the XML Schema. If you still cannot make the whole document (60 GB) validate, write a tool which splits the document into suitable chunks, like 20-100 mb or so, in a streaming fashion (in memory). Then feed each example into a schema generator and collect all the different variations of xml schema together with the corresponding sample XML. In other words, if the resulting XML Schema for chunk 3 and 4 is identical, keep only chunk 3.

You might want to normalize the output XML Schema, so variations of basic types is ignore at first. This depends on the XML schema generator.

Hopefully this will get the sample down to a much smaller set of files, which you can combine to a new sample, reapeating again the process with splitting into smaller chunks and looking for unique XML Schema.

ThomasRS
  • 8,215
  • 5
  • 33
  • 48
0

using (var zipArchive = ZipFile.Open(file.FullName, ZipArchiveMode.Read))
using (var reader = XmlReader.Create(zipArchive.Entries.First().Open()))    
     {
                
         XmlSchemaSet schemaSet = new XmlSchemaSet();
         XmlSchemaInference inference = new XmlSchemaInference();
                XmlSchemaSet schemaSet = inference.InferSchema(reader);

                // Display the inferred schema.
                Console.WriteLine("Original schema:\n");
                foreach (XmlSchema schema in schemaSet.Schemas())
                {                
                    schema.Write(Console.Out);
                    //or save it to file
                }
            }