How to detect Font Family in PDF?

Question

I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ? Is this possible in any language?
I couldn't find any library or language that could do this.

So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.

I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.

Any Suggestions on how to do this? May be using svg file or ps file

I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes. Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.

Which code shall be able to check. C, Python, grep, strings, ...? Please add to the question and what you tried so far. Thanks. — Dilettant, Jun 24 '16 at 10:46
Questions that ask "where do I start" are typically too broad and [are not a good fit for this site](http://stackoverflow.com/help/on-topic). People have their own method for approaching the problem and because of this there cannot be a _correct_ answer. Give a good read over [**Where to Start**](//meta.programmers.stackexchange.com/questions/6366/where-to-start/6367#6367), then address your post. — Claudia, Jun 24 '16 at 10:52

score 1 · Answer 1 · edited May 23 '17 at 12:09

There can be three types of text in your document:

text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.

The answer to your question depends on the type of text you're confronted with:

There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.

You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.

Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?

Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.

Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.

Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.

Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.

NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.

I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.

How to detect Font Family in PDF?

1 Answers1