0

I just found out about Schema.org. I would like to use it in my webpages. I think I have gathered a very basic and confusing idea of Schema.org so far, but unfortunately right now I can’t afford the time to dive deep into it and learn more to be able to use it properly and immediately in the pages I am building right now.

So, here is my problem:

I converted a huge 670 page book (with quite a bit of pics in addition to text) into HTML5 pages. The book is a PDF file. I broke it down to 23 chunks and then converted those chunks into equal number of HTML5 files - using a free/trial converter (converting PDF to HTML5+SVG). These HTML5 files don’t have any visible dependencies / external assets like normal HTML pages have (especially with embedded images, JS, CSS, etc.). Also, on top of the images from the original PDF file, the original text of the PDF has also been converted into “svg” image format instead of text - and embedded or encoded into the HTML files, I think. But I don’t see any external dependent files, they seem to be self-contained with lots of code only. In other words, the entire content of the book seems to be there inside those HTML files only. I am not familiar with such HTML files and not sure if this is possible or whether I am missing something here due to my lack of knowledge.

Anyway, now inside the source codes of those HTML files, I would like to tell the search engines (and other concerned parties, if any) in a Google-friendly manner as far as possible, using Microdata or JSON-LD, that —

  1. This file (the individual HTML5 file chunks) is a part or chunk (not necessarily a ‘chapter’) of (isPartOf? PublicationIssue?) a “Book” or “EBook” (of the same book or ebook). There are other similar files here too, and together they make the entire book.

  2. The main content of the book (therefore of the individual HTML files) is mostly in image format, probably SVG+XML. -- bookFormat / BookFormatType / ImageObject/ associatedMedia / MediaObject / encoding / encodesCreativeWork / encodingFormat? (Although, my understanding was that the converter is supposed to add an extracted text file or just extracted text to facilitate search, but I can’t find that.)

  3. Add: numberOfPages of the entire book (not of the individual chunks or html files), about, sameAs (for main site), description.

My problem is, I am not sure (based on my present knowledge) which Schema.org types and properties to choose for my context as described above, how to correctly and concisely write it with correct/valid syntax, and where to place it inside the source code of the HTML files. The content of the files looks to me all jumbled and almost undecipherable codes sprinkled with a bit of original text very sparsely here and there. It looks to me like all fonts, texts and images of the original are encoded in the same place here. Which are almost undistinguishable to me. So, my idea is to start in the body tag with Microdata and encapsulate everything else inside one or two div or spans. No need to identify items separately.

That’s it! Can anybody help?

UPDATE BASED ON UNOR'S REPLY

Here is the code I think I will settle on (some questions remain):

  1. To be placed in the Table of Contents (with the title of the book as header) page of the book/ebook - which will be the entry page too:--

    <script type="application/ld+json">
     {
       "@context":  "http://schema.org/",  
       "@id": "http://example.com/Archaeological_Heritage_Of_India.html#book", 
       "@type": "Book",  
       "name": "Archaeological Heritage of India",  
       "bookFormat": {"@id": "http://schema.org/EBook"},  
       "inLanguage": "en", 
       "genre": "Archaeological Heritage" **/* OR "genre": "http://vocab.getty.edu/aat/300054328" */**
      }
    </script>
    
  2. To be placed in rest of the pages of the book (ie separate individual html files) :

    <script type="application/ld+json">
      {
        "@context":  "http://schema.org/",
        "isPartOf": "http://example.com/Archaeological_Heritage_Of_India.html#book"
      }
    </script>
    

What I would like to know if this is completely correct?

Also, how can I and should I incorporate contentLocation in this (in no.1) - to indicate the geographical limit or focus of the main content of the book? How about like the following:

"contentLocation": "India" /* OR - the ISO 3166-1 alpha-2 country code: "IN" ?
Mon
  • 59
  • 1
  • 7
  • @unor Hi, can you help? – Mon Aug 05 '16 at 02:10
  • 1
    Place provide an MCV example: http://stackoverflow.com/help/mcve – Jay Gray Aug 05 '16 at 10:12
  • @JayG Here is an example: https://dl.dropboxusercontent.com/s/8rjoy9r5cdsa04h/jPDF_Web_Example_for_STACK.html?dl=0 . By the way, are files in Dropbox crawled / indexed by Google? – Mon Aug 05 '16 at 11:11
  • If I decide to upload my files to Dropbox, instead of my website, and iFrame the files from my site - will they be indexed by Google as part of my site? – Mon Aug 05 '16 at 11:17
  • I think your question is too broad. I posted an answer that gives broad entry points, but for your specific questions you should make separate question posts. – unor Aug 05 '16 at 14:00
  • @Mon - no, Dropbox files are not Google accessible. I have checked this several times with Dropbox to verify. – Jay Gray Aug 05 '16 at 18:12

1 Answers1

1

Syntax

If these are HTML5 documents, you have three options to provide structured data using Schema.org:

  • JSON-LD
  • Microdata
  • RDFa

While Microdata and RDFa define attributes that get added to your existing HTML elements, JSON-LD gets added in a separate script element.

Just because it represents a book (instead of a "normal" website) doesn’t change how JSON-LD/Microdata/RDFa can be added. Choose whatever syntax works best for you.

Vocabulary

For the whole book, you should use the Book type. EBook is not a type, but an enumeration value for the bookFormat property.

So you could have (example in JSON-LD):

<script type="application/ld+json">
{
  "@context":  "http://schema.org/",
  "@id": "http://example.com/foobar#book",
  "@type": "Book",
  "name": "Foobar",
  "bookFormat": {"@id": "http://schema.org/EBook"}
}
</script>

The URI in the first @id (http://example.com/foobar#book) would be the URI that represents the book. I added the #book fragment to differentiate between the actual book and the webpage that contains (or is about) the book (details). If you have a separate website for this book, it would make sense to use the website’s homepage URI (ideally with a fragment, like #book or something else).

Whenever you reference this book, you may use this URI instead of repeating the data on each page (e.g., for each page in isPartOf).

Community
  • 1
  • 1
unor
  • 92,415
  • 26
  • 211
  • 360
  • Hi, thank you for your reply. Just to check if I have understood you correctly - please see my final update (based on your comment) at the bottom of my main post. There are also 1 / 2 question in it needing clarification. Thanks again for your time and patience! – Mon Sep 06 '16 at 13:31
  • @Mon: As I commented earlier, your question was already too broad for Stack Overflow, and your edit made it even broader :) If you have additional questions, please create a separate question post (one main question per post). If you are just interested in a code review, you could ask on [codereview.se]. – unor Sep 06 '16 at 13:56
  • Okay. Let me totally narrow it down then. It can't be narrower and more specific than this, I think.. :) Could you possibly just confirm the point no.2 in my update above in the main post? It's part of _your_ code - _is this what you meant_ - to be placed **like this** in _rest_ of the pages except the main page? (I hope this question doesn't break the "one main question per post" rule either.It's not a _main_ question, nor can it be posed as a _main_ or _separate_ or _independent_ question, because it's totally related to and dependent on your previous reply :) ). – Mon Sep 07 '16 at 05:11
  • 1
    @Mon: When you want to say that it’s part of that book, yes. You might want to state what *it* is, i.e., provide a type for the chunk (like `"@type": "CreativeWork",`, or whichever type is suitable). – unor Sep 07 '16 at 05:36