0

I have a pdf file which I am processing by converting it into text using the following coding..

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

During processing if I am seeing any type of ambiguity in the content means error in the data of the PDF file, I have to mark the entire line of the pdf(Color that line with Red) file but I am not able to analyze how to achieve that. Please help me.

Ria
  • 10,237
  • 3
  • 33
  • 60
Adi
  • 1,395
  • 11
  • 37
  • 61
  • 1
    What does "mark the entire line of the pdf file" mean? What does "seeing any type of ambiguity in the content of the PDF file" mean? If you explain those, this will be much easier to answer. – Joel Peltonen Feb 20 '14 at 08:13
  • 1
    What does "error in the data of the PDF file" mean? – Joel Peltonen Feb 20 '14 at 08:17
  • @Nenotlep suppose price should be 10.00 but its 11.00 then this line should be marked red. – Adi Feb 20 '14 at 08:18
  • 2
    @Adi What you essntially need is a `SimpleTextExtractionStrategy` replacement which not only returns text but instead text with positions. The `LocationTextExtractionStrategy` would be a good starting point for that as it collects the text with positions (to put it in the right order). You merely have to properly retrieve the positions from that strategy. Hints for this have often been given on SO, for example cf. [this answer](http://stackoverflow.com/questions/13714605/retrieve-the-respective-coordinates-of-all-words-on-the-page-with-itextsharp/13719947#13719947). – mkl Feb 20 '14 at 10:16
  • @Adi Having extracted the coordinates, you can use a `PdfStamper` and add highlighting annotations at those coordinates. – mkl Feb 20 '14 at 10:18
  • @mkl I am not able to get the coordinates,actually i am not able to use this .please tell me how to use this in my case.thank you. – Adi Feb 21 '14 at 06:38

2 Answers2

1

Too long to be a comment; added as answer.

My good fellow and peer Adi, It depends a lot on your PDF contents. It's kind of hard to do a generic solution to something like this. What does currentText contain? Can you give an example of it? Also, if you have a lot of these PDFs to check, you need to get currentText of a few of them, just to make sure that your current PDF to string conversion produces the same result every time. If it is same every time from different PDFs; then you can start to automate.

The automation also depends a lot on your content, for example if current Text is something like this: Value: 10\nValue: 11\nValue: 9Value\n15 then what I recommend is going through every line, extracting the value and checking it against what you need it to be. This is untested semi-pseudo code that gives you an idea of what I mean:

var lines = new List<string>(currentText.Split('\n'));
var newlines = new List<string>();
foreach (var line in lines) {
    if (line != "Value: 10") {
        newLines.Add(line); // This line is correct, no marking needed
    } else {
        newlines.Add("THIS IS WRONG: " + line); // Mark as incorrect; use whatever you need here
    }
}

// Next, return newlines to the user showing them which lines are bad so they can edit the PDF

If you need to automatically edit the existing PDF, this will be very, very, very hard. I think it's beyond the scope of my answer - I was answering how to identify the wrong lines and not how to mark them - sorry! Someone else please add that answer.

By the way; PDF is NOT a good format for doing something like this. If you have access to any other source of information, most likely the other one will be better.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
Joel Peltonen
  • 13,025
  • 6
  • 64
  • 100
  • 2
    Sorry, I know this answer sucks but It's all I've got at the moment. Hopefully someone else will help you better. – Joel Peltonen Feb 20 '14 at 09:30
  • Thanks a lot! sir for your kind support.The code that u posted is already done by me .I need to do the other one as explained.Actually i have a mobile bills in PDF format that i need to validate for there correctness.If they are correct its Ok otherwise send to the Service provider with Marked lines as wrong charged Bill for a particular call. – Adi Feb 20 '14 at 10:01
  • I'm happy to see that my request to restore this answer has been accepted. It's not because the question is unanswerable that an answer explaining why **PDF is NOT a good format for doing something like this** should be removed! – Bruno Lowagie Feb 21 '14 at 07:43
  • @BrunoLowagie I do think that PDF is not a good format for doing something like this, it would be nicer to do it upstream if there is an information source that is typed or easier to manipulate. I actually think that my answer should be deleted because it answers the wrong question; I thought the key part of the question was the algorithm and not the PDF editing. Derp me! – Joel Peltonen Feb 21 '14 at 10:05
  • It's a useful answer. If anything should be deleted, it should be the questions. See http://stackoverflow.com/questions/21930509/how-to-get-the-dimensions-of-the-specific-text-while-reading-pdf-using-itext and http://stackoverflow.com/questions/21927641/how-to-search-for-particular-line-contents-in-pdf-and-make-that-line-marked-in-c The behavior of Adi is getting offensive. He ignores all the good advice that is given, including your advice that stuff like this should be done upstream. – Bruno Lowagie Feb 21 '14 at 10:13
1

As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).

If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List<TextChunk> locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.

What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.

Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).

Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either

  • do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or

  • put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List<TextChunk> each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.

Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).

To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either

  • getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by

  • getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by

  • adding a highlight annotation to the page.

To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...

Doing so you know exactly the height required for your marking rectangle.

mkl
  • 90,588
  • 15
  • 125
  • 265