0

I am trying to scrape this webpage which uses a custom font to present text in Sanskrit. I have the ttf file for the font used in the webpage.

Is there anyway I could scrape this website using ttf file and encode the content to unicode using Python (if not, any language)?

The font they are using is iitmsans.ttf from http://www.acharya.gen.in:8080/fonts/iitmfonts.php

tripleee
  • 175,061
  • 34
  • 275
  • 318
pavan
  • 39
  • 4
  • The TTF file doesn't reveal which Unicode code points these glyphs map to. You will need to create a table which contains a mapping from each character code to its corresponding Unicode code point (also known as an encoding). – tripleee Oct 31 '18 at 02:48
  • For what it's worth, the page renders simply as ½£ ÂO¤âÔOOd[kOÛOÓdÓOØO¯ etc for me. – tripleee Oct 31 '18 at 02:49
  • Is this supposed to represent Devanagari? Any chance that it's actually using (something like) [ISCII](https://en.wikipedia.org/wiki/Devanagari#ISCII) or some other existing encoding? – tripleee Oct 31 '18 at 02:52
  • The text is devanagari. It is using iitmsans fonts. – pavan Nov 01 '18 at 13:00
  • http://www.acharya.gen.in:8080/fonts/iitmfonts.php says they are using a custom encoding which is supposed to be compatible with ISO-8859-1. They are a non-profit so ideally they would have a public document with information about the encoding. Otherwise you will have to reconstruct the encoding by hand. I repeat, having the font doesn't help at all (except you can render a map to help you see what the glyphs look like so you can try to find them in a Unicode chart). – tripleee Nov 01 '18 at 13:03
  • I ran [`ttx`](https://github.com/fonttools/fonttools) on the font and it seems to contain the supremely useless ISO-8859-1 names of the glyphs instead of any indication of what they represent in their custom encoding. I can't read Devanagari so I'm probably not particularly competent to identify the glyphs but if you can generate a simple table it's probably approx 30 minutes of work for someone who is familiar with the script (or say 60 for a slow typist, to avoid overoptimistic estimates). For tangential inspiration, maybe see also https://cdn.rawgit.com/tripleee/8bit/master/encodings.html – tripleee Nov 01 '18 at 13:19
  • Thanks a lot @tripleee for information. But the output from the Wikipedia's ISCII table is gibberish. Is there any way I can use the Glyphs produced by the http://bluejamesbond.github.io/CharacterMap/ to convert the text to unicode? – pavan Nov 01 '18 at 13:40

1 Answers1

0

No, you probably have to do a bit of manual work to create an encoding for Python. The TTF file doesn't contain information about the Unicode mappings (it could but it's uncommon, and this one doesn't).

Looking at the font in http://bluejamesbond.github.io/CharacterMap/ I see many Devanagari glyphs but I don't know their names or what variations are common or permitted in drawing them so I probably can't easily find the same glyphs in Unicode for you. But I recognize the "om" glyph U+0950 on character code 65 (0x41) so I can contribute the first item in your encoding:

{
 # ...
 0x41: '\u0950',
 # ...
}

Do this for all the other glyphs in the font and you have a mapping you can use in Python. The general guidance is in the documentation for the standard codecs module, but probably you want to find examples like Custom Python Charmap Codec too.

Screen dump of OM glyph

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks a lot @tripleee for your help. I have one last doubt. How do I convert the text to character code so that I can convert that to Unicode using my mapper? – pavan Nov 01 '18 at 14:09
  • The text on the page *is* characters, you just need (to construct) a codec to `decode` them to Unicole. – tripleee Nov 01 '18 at 14:53
  • If you can provide a text table with the mappings, we can probably figure out the rest between us. If you are on Github, a gist with the first few would be a good start. – tripleee Nov 01 '18 at 16:07