PDF- Can text chunk contains 2 or more words?

Question

Im using LocationTextExtractionStrategy to render text from PDF. Text is rendered in function called RenderText. So my question is: Can one chunk contains more than 2 words ? For example we have text: 'MKL is a helpfull person' Can it be written in chunks like (the most important chunk is bolded): MK

L

is a h

elpfull

per son

?

Below is the code i use for word separation. Im doing the word separation during adding text(chunk from renderText function) to current line.

 public class TextLineLocation
{
    public float X { get; set; }
    public float Y { get; set; }
    public float Height { get; set; }
    public float Width { get; set; }
    private string Text;
    private List<char> bannedSings = new List<char>() {' ',',', '.', '/', '|', Convert.ToChar(@"\"), ';', '(', ')', '*', '&', '^', '!','?' };
    public void AddText(TextInfo text)
    {
        Text += text;
        foreach (char sign in bannedSings)
        {
            //creating new word
            if (text.textChunk.Text.Contains(sign))
            {
                string[] splittedText = text.textChunk.Text.Split(sign);
                foreach (string val in splittedText)
                {
                    //if its first element, add it to current word
                    if (splittedText[0] == val)
                    {
                        // if its space, just ignore...
                        if (splittedText[0] == " ")
                        {
                            continue;
                        }
                        wordList[wordList.Count - 1].Text += val;
                        wordList[wordList.Count - 1].Width += text.getFontWidth();
                        wordList[wordList.Count - 1].Height += text.getFontHeight();
                    }
                    else
                    {
                        //if it isnt a first element, create another word
                        wordList.Add(new WordLocation(text.textChunk.StartLocation[1], text.textChunk.StartLocation[0], text.getFontWidth(), text.getFontHeight(), val));
                        //TODO: what if chunk has more than 2 words separated ?
                    }
                }
            }
        }
        else
        {
            //update last word
            wordList[wordList.Count-1].Text += text.textChunk.Text;
            wordList[wordList.Count - 1].Width += text.getFontWidth();
            wordList[wordList.Count - 1].Height += text.getFontHeight();
        }
    }
    public List<WordLocation> wordList = new List<WordLocation>();


}

im trying to extend algorithm from [here](https://stackoverflow.com/questions/23909893/getting-coordinates-of-string-using-itextextractionstrategy-and-locationtextextr) to return word location (X,Y,Width,Height) instead of all lines- i already added Width and Height to returning lines, but im wondering about chunks... what they can consist of ? — Bartosz Olchowik, Jul 16 '18 at 18:51
Thanks for the compliment ;). As @dirkt answered, *you cannot rely on anything*. A chunk can contain anything from a single letter to a whole line (even across multiple columns). There can even be less than the visible character, e.g. a 'â' might be built from two chunks 'a' and '^'. One thing from your example is not likely to occur, though: if the word "person" comes as a single chunk, it is very unlikely that that chunk contains a space 'per son'. — mkl, Jul 17 '18 at 06:31
Ok, so i have to parse it wisely, and hope that my method will work for most of pdfs. I dont really care for national signs right now, my point is to erase sensitive data from pdfs for example: document numbers, prices, name and surname. Ofcourse some names can contain special signs, but i think its not this time to solve problems like that. Thank you for your reply. — Bartosz Olchowik, Jul 17 '18 at 10:47

score 0 · Accepted Answer · answered Jul 16 '18 at 19:39

0

Not sure from what library LocationTextExtractionStrategy comes, or what it does exactly, but in the PDF representation itself you can group characters together in a "chunk".

How this is used totally depends on the program that produces the PDF: Some programs keep words together, some programs only group word fragments (for example for kerning), some program do other, random things.

So, if LocationTextExtractionStrategy returns these as chunks, you can't rely on anything. If LocationTextExtractionStrategy doesn't return these, but instead relies on spacing heuristics to group characters into chunks, then this will be as good as the heuristics are.

Bottom line: A PDF doesn't contain text, and contains glyphs and their position on the page. Trying to reconstruct text from it is and remains guesswork. You may get it to work in the majority of cases, but there'll always be PDFs where whatever you are doing fails.

answered Jul 16 '18 at 19:39

dirkt

463
1
4
12

Thank you for reply. LocationText~ class comes from iTextSharp.text.pdf.parser. – Bartosz Olchowik Jul 16 '18 at 19:50
Are you sure about your last sentence, that " A PDF doesn't contain text, and contains glyphs and their position on the page" ? Can i read it somewhere in documentation ? – Bartosz Olchowik Jul 17 '18 at 15:25
Yes. PDFs contain a restricted form of Postscript embedded in an object tree. You can associate glyphs with characters (and reconstruct the text), *if* the PDF contains the tables for that, but then you still don't know where words begin and end. The standard is e.g. [here](https://www.adobe.com/devnet/pdf/pdf_reference.html). You can use tools like `mutool` from `mupdf` to decompress the streams, then you can open a PDF file in a text editor and see for yourself. – dirkt Jul 18 '18 at 05:54

PDF- Can text chunk contains 2 or more words?

1 Answers1