11

I need to create AVRO file but for that I need 2 things:

1) JSON

2) Avro Schema

From these 2 requirements - I have JSON:

{"web-app": {
  "servlet": [   
    {
      "servlet-name": "cofaxCDS",
      "servlet-class": "org.cofax.cds.CDSServlet",
      "init-param": {
        "configGlossary:installationAt": "Philadelphia, PA",
        "configGlossary:adminEmail": "ksm@pobox.com",
        "configGlossary:poweredBy": "Cofax",
        "configGlossary:poweredByIcon": "/images/cofax.gif",
        "configGlossary:staticPath": "/content/static",
        "templateProcessorClass": "org.cofax.WysiwygTemplate",
        "templateLoaderClass": "org.cofax.FilesTemplateLoader",
        "templatePath": "templates",
        "templateOverridePath": "",
        "defaultListTemplate": "listTemplate.htm",
        "defaultFileTemplate": "articleTemplate.htm",
        "useJSP": false,
        "jspListTemplate": "listTemplate.jsp",
        "jspFileTemplate": "articleTemplate.jsp",
        "cachePackageTagsTrack": 200,
        "cachePackageTagsStore": 200,
        "cachePackageTagsRefresh": 60,
        "cacheTemplatesTrack": 100,
        "cacheTemplatesStore": 50,
        "cacheTemplatesRefresh": 15,
        "cachePagesTrack": 200,
        "cachePagesStore": 100,
        "cachePagesRefresh": 10,
        "cachePagesDirtyRead": 10,
        "searchEngineListTemplate": "forSearchEnginesList.htm",
        "searchEngineFileTemplate": "forSearchEngines.htm",
        "searchEngineRobotsDb": "WEB-INF/robots.db",
        "useDataStore": true,
        "dataStoreClass": "org.cofax.SqlDataStore",
        "redirectionClass": "org.cofax.SqlRedirection",
        "dataStoreName": "cofax",
        "dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver",
        "dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon",
        "dataStoreUser": "sa",
        "dataStorePassword": "dataStoreTestQuery",
        "dataStoreTestQuery": "SET NOCOUNT ON;select test='test';",
        "dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log",
        "dataStoreInitConns": 10,
        "dataStoreMaxConns": 100,
        "dataStoreConnUsageLimit": 100,
        "dataStoreLogLevel": "debug",
        "maxUrlLength": 500}},
    {
      "servlet-name": "cofaxEmail",
      "servlet-class": "org.cofax.cds.EmailServlet",
      "init-param": {
      "mailHost": "mail1",
      "mailHostOverride": "mail2"}},
    {
      "servlet-name": "cofaxAdmin",
      "servlet-class": "org.cofax.cds.AdminServlet"},

    {
      "servlet-name": "fileServlet",
      "servlet-class": "org.cofax.cds.FileServlet"},
    {
      "servlet-name": "cofaxTools",
      "servlet-class": "org.cofax.cms.CofaxToolsServlet",
      "init-param": {
        "templatePath": "toolstemplates/",
        "log": 1,
        "logLocation": "/usr/local/tomcat/logs/CofaxTools.log",
        "logMaxSize": "",
        "dataLog": 1,
        "dataLogLocation": "/usr/local/tomcat/logs/dataLog.log",
        "dataLogMaxSize": "",
        "removePageCache": "/content/admin/remove?cache=pages&id=",
        "removeTemplateCache": "/content/admin/remove?cache=templates&id=",
        "fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder",
        "lookInContext": 1,
        "adminGroupID": 4,
        "betaServer": true}}],
  "servlet-mapping": {
    "cofaxCDS": "/",
    "cofaxEmail": "/cofaxutil/aemail/*",
    "cofaxAdmin": "/admin/*",
    "fileServlet": "/static/*",
    "cofaxTools": "/tools/*"},

  "taglib": {
    "taglib-uri": "cofax.tld",
    "taglib-location": "/WEB-INF/tlds/cofax.tld"}}}

But how to create AVRO Schema based on it?

Looking for programatic way to do that since will have many schemas and can not create Avro Schema manually every time.

I checked 'avro-tools-1.8.1.jar' but that can not create Avro Schema from JSON directly.

Looking for a Jar or Python code that can create JSON -> Avro schema. It is ok if Data Types are not perfect (Strings, Integers and Floats are good enough for start).

Cœur
  • 37,241
  • 25
  • 195
  • 267
Joe
  • 11,983
  • 31
  • 109
  • 183
  • JSON is basically schemaless. What is the source of this JSON? – Elliott Frisch Oct 04 '17 at 03:38
  • Thanks. I do understand that JSON is schemaless. However, in my project - different customers have JSON and they send me as that. There will be many different JSONs - above is just 1 example. I do not have ability to force them to create AVRO but AVRO format is required for my project. I have 2 options: 1) Manually create with every customer AVRO schema for every JSON and 2) Try to use some code to automate creating AVRO schema based on JSON (even if is not perfect). Looking for option 2. Thanks. – Joe Oct 04 '17 at 03:46
  • Store it as a `String`. – Elliott Frisch Oct 04 '17 at 03:48
  • I can not use String. AVRO format is required by a project and String is not accepted. – Joe Oct 04 '17 at 03:49

4 Answers4

12

This one works cool with a simple copy and paste of avro schema.

https://toolslick.com/generation/metadata/avro-schema-from-json

Codex
  • 1,153
  • 1
  • 20
  • 31
  • This saved the day for me! Awesome. – Susheel Javadi May 01 '20 at 16:12
  • This works. However. If your json has spaces in the key fields, it will include spaces in the avro names, which is invalid. You will need to take the spaces out. Add underscores or something. otherwise, it seems to work. If nothing else, it is a good way to see how something complicated might be formatted in avro. – ajpieri Mar 26 '21 at 17:00
  • @ajpieri The tool can now handle spaces in JSON property names by replacing it with another character such as underscore. – Partho Mar 05 '22 at 20:09
  • 1
    This is not a free service anymore. – vlyalcin Dec 12 '22 at 18:41
11

you can use Kite SDK util to infer avro schema from a json input.

https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/JsonUtil.java#L539

Example:

    String json = "{\n" +
            "    \"id\": 1,\n" +
            "    \"name\": \"A green door\",\n" +
            "    \"price\": 12.50,\n" +
            "    \"tags\": [\"home\", \"green\"]\n" +
            "}\n"
            ;
    String avroSchema = JsonUtil.inferSchema(JsonUtil.parse(json), "myschema").toString();
    System.out.println(avroSchema);

Result:

{  
   "type":"record",
   "name":"myschema",
   "fields":[  
      {  
         "name":"id",
         "type":"int",
         "doc":"Type inferred from '1'"
      },
      {  
         "name":"name",
         "type":"string",
         "doc":"Type inferred from '\"A green door\"'"
      },
      {  
         "name":"price",
         "type":"double",
         "doc":"Type inferred from '12.5'"
      },
      {  
         "name":"tags",
         "type":{  
            "type":"array",
            "items":"string"
         },
         "doc":"Type inferred from '[\"home\",\"green\"]'"
      }
   ]
}

You can find the maven dependency here

hlagos
  • 7,690
  • 3
  • 23
  • 41
6

Give this one a shot. http://www.dataedu.ca/avro

It basically infers the Avro schema that accepts the JSON.

You can even give it a JSON array. What it would do is generating an Avro schema that is compatible with all the JSON documents in your array.

There are other tools that you can verify the result.

Iraj Hedayati
  • 1,478
  • 17
  • 23
0

If you want to avoid creating a dedicated AVRO schema for every JSON format, you can use rec-avro package.

It allows you to take any python data structure, including parsed XML or JSON and store it in Avro without a need for a dedicated schema.

I tested it for python 3.

You can install it as pip3 install rec-avro or see the code and docs at https://github.com/bmizhen/rec-avro

I gave a json to avro example code here: https://stackoverflow.com/a/55444481/6654219

boriska
  • 171
  • 1
  • 8