4

I am using below method to extract pdf text line by line. But problem that, it is not reading spaces between words and figures. what could be the solution for this ??

I just want to create a list of string, each string in list object has a text line from pdf as it is in pdf including spaces.

public void readtextlinebyline(string filename)   {


        List<string> strlist = new List<string>();
        PdfReader reader = new PdfReader(filename);
        string text = string.Empty;
        for (int page = 1; page <= 1; page++)
        {

            text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy())+" ";

        }
        reader.Close();
        string[] words = text.Split('\n');
        foreach (string word in words)
        {
            strlist.Add(word);
        }

        foreach (string st in strlist)
        {
            Response.Write(st +"<br/>");
        }

   }

I have tried this method by changing strategy to SimpleTextExtractionStrategy as well but it is also not working for me.

NetStarter
  • 3,189
  • 7
  • 37
  • 48
shailendra
  • 165
  • 2
  • 3
  • 8
  • This [answer to "itext java pdf to text creation"](http://stackoverflow.com/questions/13644419/itext-java-pdf-to-text-creation/13645183#13645183) may illustrate the reason and hint at a solution: Copy the text extration strategy and tweak the internal parameters, in your case the minimum width of a gap to be recognized as a space, `renderInfo.getSingleSpaceWidth()/2f` by default; the person who asked back there got improved results with `renderInfo.getSingleSpaceWidth()/4f`. – mkl May 06 '13 at 13:25
  • @Pengu As you offer a bounty, you surely are subject to this problem. Thus, you surely can supply one or more sample PDFs to serve as test cases for proposed solutions. The current state of the question makes answering pure guesswork. – mkl Nov 11 '13 at 16:19
  • @mkl Im sorry for the late response, my connection broke.What i dislike is not your solution (it works) - what resents me is that this solution is probably not reliable. F.E: it works with one file, but maybe on another file it would produce too much spaces (Cause that document needs renderInfo.getSingleSpaceWidth()/2f or a total different divider). I havent an example for that but its someting i could imagine that it can happen. So I asked for answers from a "more" reliable source. – BudBrot Nov 18 '13 at 07:53
  • @Pengu Unfortunately you won't easily get a generic 100% reliable solution. Some problems making it hard to get it are mentioned in the answer I pointed to. It can be really hard to differentiate between kerning and closely set words. – mkl Nov 18 '13 at 11:50
  • @mkl Yep, i thought something like that. Sad but not changeable. i also tried many things like try calculating the spacesize based on the font etc, but nothing works as good as your already posted solution. If you post your solution again as answear i can give you the reputation. – BudBrot Nov 18 '13 at 12:18
  • @Pengu Ok, I did so, adding some more backgrounds, and while doing so stumbled on an iTextSharp deficiency... oh well. ;) – mkl Nov 18 '13 at 14:22
  • I got fairly good results just using text = text.Replace("\n", "\r\n"); – Jack Griffin Mar 05 '21 at 17:48

3 Answers3

16

The backgrounds on why space between words sometimes is not properly recognized by iText(Sharp) or other PDF text extractors, have been explained in this answer to "itext java pdf to text creation": These 'spaces' are not necessarily created using a space character but instead using an operation creating a small gap. These operations are also used for other purposes (which do not break words), though, and so a text extractor must use heuristics to decide whether such a gap is a word break or not...

This especially implies that you never get a 100% secure word break detection.

What you can do, though, is to improve the heuristics used.

iText and iTextSharp standard text extraction strategies, e.g. assume a word break in a line if

a) there is a space character or

b) there is a gap at least as wide as half a space character.

Item a is a sure hit but item b may often fail in case of densely set text. The OP of the question to the answer referenced above got quite good results using a fourth of the width of a space character instead.

You can tweak these criteria by copying and changing the text extraction strategy of your choice.

In the SimpleTextExtractionStrategy you find this criterion embedded in the renderTextmethod:

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}

In case of the LocationTextExtractionStrategy this criterion meanwhile has been put into a method of its own:

/**
 * Determines if a space character should be inserted between a previous chunk and the current chunk.
 * This method is exposed as a callback so subclasses can fine tune the algorithm for determining whether a space should be inserted or not.
 * By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the
 * previous chunk and the beginning of the current chunk.  It will also indicate that a space is needed if the starting point of the new chunk 
 * appears *before* the end of the previous chunk (i.e. overlapping text).
 * @param chunk the new chunk being evaluated
 * @param previousChunk the chunk that appeared immediately before the current chunk
 * @return true if the two chunks represent different words (i.e. should have a space between them).  False otherwise.
 */
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;
    return false;
}

The intention for putting this into a method of its own was to merely require simple subclassing of the strategy and overriding this method to adjust the heuristics criteria. This works fine in case of the equivalent iText Java class but during the port to iTextSharp unfortunately no virtual has been added to the declaration (as of version 5.4.4). Thus, currently copying the whole strategy is still necessary for iTextSharp.

@Bruno You might want to tell the iText -> iTextSharp porting team about this.

While you can fine tune text extraction at these code locations you should be aware that you will not find a 100% criterion here. Some reasons are:

  • Gaps between words in densely set text can be smaller than kerning or other gaps for some optical effect inside words. Thus, there is no one-size-fits-all factor here.
  • In PDFs not using the space character at all (as you can always use gaps, this is possible), the "width of a space character" might be some random value or not determinable at all!
  • There are funny PDFs abusing the space character width (which can individually be stretched at any time for the operations to follow) to do some tabular formatting while using gaps for word breaking. In such a PDF the value of the current width of a space character cannot seriously be used to determine word breaks.
  • Sometimes you find s i n g l e words in a line printed spaced out for emphasis. These will likely be parsed as a collection of one-letter words by most heuristics.

You can get better than the iText heuristics and those derived from it using other constants by taking into account the actual visual free space between all characters (using PDF rendering or font information analysis mechanisms), but for a perceivable improvement you have to invest much time.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Excellent writeup. @mkl, you may want to open an issue in the iText bug tracker about the iTextSharp port (not sure if Bruno will see this or not). – Kevin Day Feb 25 '14 at 04:31
  • 1
    As far as i know the iTextSharp Port meanwhile has added `virtual` to this `LocationTextExtractionStrategy` method. Actually not merely this method but virtually every `public` method. – mkl Feb 25 '14 at 05:16
  • Brilliant answer. Exactly the information I needed and written up very completely and clearly. Thank you so much. – Jansky Mar 07 '16 at 11:13
0

I have my own implementation, and it works very well.

    /// <summary>
    /// Read a PDF file and returns the string content.
    /// </summary>
    /// <param name="par">ByteArray, MemoryStream or URI</param>
    /// <returns>FileContent.</returns>
    public static string ReadPdfFile(object par)
    {
        if (par == null) throw new ArgumentNullException("par");

        PdfReader pdfReader = null;
        var text = new StringBuilder();

        if (par is MemoryStream)
            pdfReader = new PdfReader((MemoryStream)par);
        else if (par is byte[])
            pdfReader = new PdfReader((byte[])par);
        else if (par is Uri)
            pdfReader = new PdfReader((Uri)par);

        if (pdfReader == null)
            throw new InvalidOperationException("Unable to read the file.");

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }

        pdfReader.Close();

        return text.ToString();
    }
  • 2
    You use the standard iText(Sharp) text extraction mechanism with the `SimpleTextExtractionStrategy` while the OP used it with the `LocationTextExtractionStrategy`. While this certainly is a difference, they use essentially the same heuristics to determine a word break (a space character or a gap at least half as wide as a space character). Thus, this will hardly do any better than the original code. – mkl Nov 18 '13 at 13:16
  • mkl is right. This may work in some cases but will fail in some others like mine. (I got unrecognizable characters due to different encoding) Also the SimpleTextExtractionStrategy does not insert '\n' properly in my case so I have to build my custom RenderListener (as I need to extract image as well) and tweak the code to meet my requirement, e.g. change the condition that detects new line from orientationMagnitude == other. OrientationMagnitude to Math.Abs(orientationMagnitude - other.OrientationMagnitude) < 10. Obviously it wont work in all cases. – Silent Sojourner Jul 06 '17 at 21:12
0
using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();
                StringBuilder textfinal = new StringBuilder();
                String page = "";
                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                    page = PdfTextExtractor.GetTextFromPage(reader, i);
                    string[] lines = page.Split('\n');
                    foreach (string line in lines)
                    {
                        string[] words = line.Split('\n');
                        foreach (string wrd in words)
                        {

                        }
                        textfinal.Append(line);
                        textfinal.Append(Environment.NewLine); 
                    }
                    page = "";
                }
           }