1

I want to map structured data (microdata, jsonld) extracted from html text into a Java POJO. For extraction I use the library Apache Any23 and configured a JSONLDWriter to convert the structured data found in the html document into json-ld format.

This works as expected an gives me the following output:

[ {
  "@graph" : [ {
    "@id" : "_:node1gn1v4pudx1",
    "@type" : [ "http://schema.org/JobPosting" ],
    "http://schema.org/datePosted" : [ {
      "@language" : "en-us",
      "@value" : "Wed Jan 11 02:00:00 UTC 2023"
    } ],
    "http://schema.org/description" : [ {
      "@language" : "en-us",
      "@value" : "Comprehensive Job Description"
    } ],
    "http://schema.org/hiringOrganization" : [ {
      "@language" : "en-us",
      "@value" : "Org AG"
    } ],
    "http://schema.org/jobLocation" : [ {
      "@id" : "_:node1gn1v4pudx2"
    } ],
    "http://schema.org/title" : [ {
      "@language" : "en-us",
      "@value" : "Recruiter (m/f/d)\n    "
    } ]
  }, {
    "@id" : "_:node1gn1v4pudx2",
    "@type" : [ "http://schema.org/Place" ],
    "http://schema.org/address" : [ {
      "@id" : "_:node1gn1v4pudx3"
    } ]
  }, {
    "@id" : "_:node1gn1v4pudx3",
    "@type" : [ "http://schema.org/PostalAddress" ],
    "http://schema.org/addressCountry" : [ {
      "@language" : "en-us",
      "@value" : "Company Country"
    } ],
    "http://schema.org/addressLocality" : [ {
      "@language" : "en-us",
      "@value" : "Company City"
    } ],
    "http://schema.org/addressRegion" : [ {
      "@language" : "en-us",
      "@value" : "Company Region"
    } ]
  }, {
    "@id" : "https://career.company.com/job/Recruiter/",
    "http://www.w3.org/1999/xhtml/microdata#item" : [ {
      "@id" : "_:node1gn1v4pudx1"
    } ]
  } ],
  "@id" : "https://career.company.com/job/Recruiter/"
} ]

Next I want to deserialize the json-ld object into a Java bean using jackson. The POJO class should look somthing like this:

public class JobPosting {
    private String datePosting;
    private String hiringOrganization;
    private String title;
    private String description;

    // Following members could be enclosed in a class too if easier
    // Like class Place{private PostalAddress postalAddress;}
    // private Place place;
    private String addressCountry;
    private String addressLocality;
    private String addressRegion;
}

I would like to do it with annotations provided by Jackson lib but I struggle with a few things:

  • The @type value wrapped with an array node
  • The actual data has an extra @value layer
  • And some objects only hold a reference to other objects in the graph via @id fields

How can I map these fields to my Java Pojo properly?

wero026
  • 1,187
  • 2
  • 11
  • 23

3 Answers3

1

The trick is to process the json-ld with a json-ld processor to get a more developer friendly json. The titanium-json-ld library provides such processors.

JsonDocument input = JsonDocument.of(jsonLdAsInputStream);
JsonObject frame = JsonLd.frame(input, URI.create("http://schema.org")).get();

The above code snippet resolves references via @id and resolves json keys with the given IRI.
That leads to the following output which is easy to parse via Jackson lib:

[{
  "id": "_:b0",
  "type": "JobPosting",
  "datePosted": {
    "@language": "en-us",
    "@value": "Wed Jan 11 02:00:00 UTC 2023"
  },
  "description": {
    "@language": "en-us",
    "@value": "Comprehensive Job Description"
  },
  "hiringOrganization": {
    "@language": "en-us",
    "@value": "Org AG"
  },
  "jobLocation": {
    "id": "_:b1",
    "type": "Place",
    "address": {
      "id": "_:b2",
      "type": "PostalAddress",
      "addressCountry": {
        "@language": "en-us",
        "@value": "Company Country"
      },
      "addressLocality": {
        "@language": "en-us",
        "@value": "Company City"
      },
      "addressRegion": {
        "@language": "en-us",
        "@value": "Company Region"
      }
    }
  },
  "title": {
    "@language": "en-us",
    "@value": "Recruiter (m/f/d)\n    "
  }
}]
wero026
  • 1,187
  • 2
  • 11
  • 23
0

Looking the elements you are interested in the json (for example the "datePosted", "hiringOrganization" values) they are always labelled by "@value" and included in the array corresponding to their names (in this case "http://schema.org/datePosted" and "http://schema.org/hiringOrganization". These are all contained in a part of your json file that can be converted to a JsonNode node that can be obtained in the following way:

JsonNode root = mapper.readTree(json)
                      .get(0)
                      .get("@graph")
                      .get(0);

So if you have a pojo like below:

@Data
public class JobPosting {

    private String datePosted;
    private String hiringOrganization;
}

and you want to retrieve the datePosted and hiringOrganization values you can check that the relative position is still the same in the json file and can be calculated in a for loop:

JsonNode root = mapper.readTree(json)
                               .get(0)
                               .get("@graph")
                               .get(0);

String strSchema = "http://schema.org/";
String[] fieldNames = {"datePosted", "hiringOrganization"};
//creating a Map<String, String> that will be converted to the JobPosting obj
Map<String, String> map = new HashMap<>();
        for (String fieldName: fieldNames) {
            map.put(fieldName, 
                    root.get(strSchema + fieldName)
                        .get(0)
                        .get("@value")
                        .asText()
            );
        }
  
JobPosting jobPosting = mapper.convertValue(map, JobPosting.class);
//it prints JobPosting(datePosted=Wed Jan 11 02:00:00 UTC 2023, hiringOrganization=Org AG)
System.out.println(jobPosting);
dariosicily
  • 4,239
  • 2
  • 11
  • 17
  • That plays out for fields directly defined in the JobPosting object, but not for fields linked to another object via id. Check out the jobLocation field, it is connected to the `"http://schema.org/Place"` object which is again connected to the `"http://schema.org/PostalAddress"` object which holds the required values. – wero026 Jan 23 '23 at 05:25
0

This would require some preprocessing first to turn your graph with id pointers into a simplified tree before mapping it with Jackson:

  1. Turn it into a tree by replacing the @id references with the actual objects themselves.
  2. Flatten those troublesome object/array wrappers around @value.

Full code below, using Java 17 and a bit of recursion:

package org.example;

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.annotation.JsonSubTypes;
import com.fasterxml.jackson.annotation.JsonTypeInfo;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.io.File;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

import static java.util.stream.Collectors.toMap;

class Main {

  public static void main(String[] args) throws Exception {
    var mapper = new ObjectMapper();
    var node = mapper.readValue(new File("test.json"), Object.class);

    // Build a lookup map of "@id" to the actual object.
    var lookup = buildLookup(node, new HashMap<>());

    // Replace "@id" references with the actual objects themselves instead
    var referenced = lookupReferences(node, lookup);

    // Flattens single object array containing "@value" to be just the "@value" themselves
    var flattened = flatten(referenced);

    // Jackson should be able to under our objects at this point, so convert it
    var jobPostings =
        mapper.convertValue(flattened, new TypeReference<List<RootObject>>() {}).stream()
            .flatMap(it -> it.graph().stream())
            .filter(it -> it instanceof JobPosting)
            .map(it -> (JobPosting) it)
            .toList();

    System.out.println(jobPostings);
  }

  private static Map<String, Object> buildLookup(Object node, Map<String, Object> lookup) {
    if (node instanceof List<?> list) {
      for (var value : list) {
        buildLookup(value, lookup);
      }
    } else if (node instanceof Map<?, ?> map) {
      for (var value : map.values()) {
        buildLookup(value, lookup);
      }
      if (map.size() > 1 && map.get("@id") instanceof String id) {
        lookup.put(id, node);
      }
    }
    return lookup;
  }

  private static Object lookupReferences(Object node, Map<String, Object> lookup) {
    if (node instanceof List<?> list
        && list.size() == 1
        && list.get(0) instanceof Map<?, ?> map
        && map.size() == 1
        && map.get("@id") instanceof String id) {
      return lookupReferences(lookup.get(id), lookup);
    }

    if (node instanceof List<?> list) {
      return list.stream().map(value -> lookupReferences(value, lookup)).toList();
    }

    if (node instanceof Map<?, ?> map) {
      return map.entrySet().stream()
          .map(entry -> Map.entry(entry.getKey(), lookupReferences(entry.getValue(), lookup)))
          .collect(toMap(Entry::getKey, Entry::getValue));
    }

    return node;
  }

  private static Object flatten(Object node) {
    if (node instanceof List<?> list && list.size() == 1) {
      if (list.get(0) instanceof String s) {
        return s;
      }
      if (list.get(0) instanceof Map<?, ?> map) {
        var value = map.get("@value");
        if (value != null) {
          return value;
        }
      }
    }

    if (node instanceof List<?> list) {
      return list.stream().map(Main::flatten).toList();
    }

    if (node instanceof Map<?, ?> map) {
      return map.entrySet().stream()
          .map(entry -> Map.entry(entry.getKey(), flatten(entry.getValue())))
          .collect(toMap(Entry::getKey, Entry::getValue));
    }

    return node;
  }
}

@JsonIgnoreProperties(ignoreUnknown = true)
record RootObject(@JsonProperty("@graph") List<GraphObject> graph) {}

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "@type", defaultImpl = Ignored.class)
@JsonSubTypes({
  @JsonSubTypes.Type(value = JobPosting.class, name = "http://schema.org/JobPosting"),
  @JsonSubTypes.Type(value = Place.class, name = "http://schema.org/Place"),
  @JsonSubTypes.Type(value = PostalAddress.class, name = "http://schema.org/PostalAddress"),
})
interface GraphObject {}

@JsonIgnoreProperties(ignoreUnknown = true)
record Ignored() implements GraphObject {}

@JsonIgnoreProperties(ignoreUnknown = true)
record JobPosting(
    @JsonProperty("http://schema.org/title") String title,
    @JsonProperty("http://schema.org/description") String description,
    @JsonProperty("http://schema.org/hiringOrganization") String hiringOrganization,
    @JsonProperty("http://schema.org/datePosted") String datePosted,
    @JsonProperty("http://schema.org/jobLocation") Place jobLocation)
    implements GraphObject {}

@JsonIgnoreProperties(ignoreUnknown = true)
record Place(@JsonProperty("http://schema.org/address") PostalAddress address)
    implements GraphObject {}

@JsonIgnoreProperties(ignoreUnknown = true)
record PostalAddress(
    @JsonProperty("http://schema.org/addressLocality") String locality,
    @JsonProperty("http://schema.org/addressRegion") String region,
    @JsonProperty("http://schema.org/addressCountry") String country)
    implements GraphObject {}

Lae
  • 589
  • 1
  • 5
  • Giving you the bounty for the best answer provided within the bounty time range. And I see that you put a lot of effort into it. – wero026 Jan 27 '23 at 09:53