0

In my progam I extracted text from a PDF file and it works well. ItextSharp extracts text from PDF line by line. However, when a PDF file contains 2 columns, the extracted text is not ok as in each line joins two columns.

My problem is: How can I extract text column by column?

Below is my code. PDF files are Arabic. I'm sorry my English is not so good.

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;

for (int i = 1; i <= intPageNum; i++)
{
    text = PdfTextExtractor.GetTextFromPage(reader, i, 
               new LocationTextExtractionStrategy());

    words = text.Split('\n');
    for (int j = 0, len = words.Length; j < len; j++)
    {
        line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
        // other things here
    }

    // other things here
}
nevets
  • 4,631
  • 24
  • 40
mansureh
  • 144
  • 2
  • 7
  • 1
    A complete code sample and link to the PDF might help get better responses. – Kyle Aug 26 '14 at 05:27
  • Have you tried the simple text extraction strategy instead? If the text drawing operations in the content stream are in reading order, using it might be the better choice. – mkl Aug 26 '14 at 11:14
  • Make sure you read [this](http://stackoverflow.com/a/7515625/231316). Also, don't do the `GetString` and `GetBytes` thing, see [this](http://stackoverflow.com/a/10191879/231316) – Chris Haas Aug 26 '14 at 12:59
  • I used Simple text extraction, for 2 columns is ok, but for 1 column it doesn't work good. My PDFs files are right to left and simple text extraction reads it left to right. – mansureh Aug 27 '14 at 03:51
  • @ChrisHaas Is this solution suitable for RightToLeft languages? In linked page Mark said:"The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text". – mansureh Aug 27 '14 at 04:02
  • I posted that to make sure you understand _you don't actually have columns_. You don't even have lines of text. How a PDF looks has nothing to do with how PDF written. What you call "two columns" could actually be one "line" of two with a giant space in the middle. What you call a line in "one column" could actually be 20 unrelated lines in code. The text extraction strategies can help with the latter but not the former. The code Mark posted should work with RTL but it might come out backwards. You'll just have to try it. – Chris Haas Aug 27 '14 at 13:00
  • check this question that may help you http://stackoverflow.com/questions/16080741/convert-arabicunicode-content-html-or-xml-to-pdf-using-itextsharp – Mohamed Salah Jul 27 '15 at 14:45

1 Answers1

3

You may want to use RegionTextRenderFilter to restrict a column region then use LocationTextExtractionStrategy to extract the text. However this requires prior knowledge to the PDF file your are parsing, i.e. you need information about the column's position and size.

In more details, you need to pass in the coordinates of your column to define a rectangle, then extract the text from that rectangle. A sample will be like this:

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;    

private string GetColumnText(float llx, float lly, float urx, float ury)
{
    // reminder, parameters are in points, and 1 in = 2.54 cm = 72 points
    var rect = new iTextSharp.text.Rectangle(llx, lly, urx, ury);

    var renderFilter = new RenderFilter[1];
    renderFilter[0] = new RegionTextRenderFilter(rect);

    var textExtractionStrategy =
            new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
                                           renderFilter);

    var text = PdfTextExtractor.GetTextFromPage(reader, intPageNum,
                                                textExtractionStrategy);

    return text;
}

Here is another post discussing what you want, you may want to check as well: iTextSharp - Reading PDF with 2 columns. But they didn't hit the solution either :(

Community
  • 1
  • 1
nevets
  • 4,631
  • 24
  • 40
  • Thanks, I undrestand it, but how can I find llx,lly values? And what about 1 column pdf? How can program count number of columns? – mansureh Aug 26 '14 at 05:53
  • You pointed out the worst thing: you need to have prior knowledge about the pdf, say column position, and its size :( – nevets Aug 26 '14 at 06:12
  • what i suggest is to add a gui that allows user to specify the column number and column region. – nevets Aug 26 '14 at 06:23
  • oh most of the parameter in iTextSharp are in points: 1 inch = 2.54 cm = 72 points – nevets Aug 26 '14 at 07:25