8

I'm currently trying to read text from pdf file using itextsharp using the following code and assigning to a textbox (MultiLine) - (Windows Desktop App)

Note: This code works fine.

public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
                pdfReader.Close();
            }
            return text.ToString();
        }

BUT My pdf file has an equation

enter image description here

and all i'm getting is the follwing output

enter image description here

what could be added here to achieve the following text? Any sort of help would really be appreciated!

Aimal Khan
  • 1,009
  • 1
  • 12
  • 25
  • I upvoted this question because I find it interesting, but I think this is going to be really, really hard. How was the pdf created in the first place? Can you share it? – Amedee Van Gasse Aug 06 '16 at 15:12
  • 4
    What sort of output are you hoping for? Your math expression cannot be expressed in the Basic Multilingual Plane. – Jongware Aug 06 '16 at 21:54
  • @amedeevangasse Well it is quite simple. Check out the latex software! You need to activate the math mode for it, enter equations and it gives you output in pdf format. – Aimal Khan Aug 08 '16 at 05:30
  • 1
    I already guessed it was LaTeX, but does it put enough information into a pdf to be able to do the reverse operation? Doesn't look like it... – Amedee Van Gasse Aug 08 '16 at 05:38
  • What Rad Lexus wrote. Please write the math expression that you were expecting... – Amedee Van Gasse Aug 08 '16 at 05:39
  • Not directly related but completely remove the line `currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));` It doesn't do what you think it does and will eventually break things. See [this for more](http://stackoverflow.com/a/10191879/231316). – Chris Haas Aug 08 '16 at 12:36
  • So are you looking for a LaTeX file to be generated then? The reason everyone keeps asking you what the text should be is that we want to see in straight Unicode what exactly you expect – Chris Haas Aug 08 '16 at 12:38
  • May I suggest that if there's literally only one or two equations that you enter them by hand? Unless you have to repeat this operation many times it may be more efficient time-wise than what you're asking for. – jamesh625 Aug 11 '16 at 21:34

1 Answers1

1

I used itextsharp and i am 100% sure its not possible. Problem is within pdf format itself. It does not contains any tags refered to some text. Pdf contains specific graphical representation of content which has its position on pdf page. Without OCR its even impossible to detect bolded text. Pdf isnt good format to parse.

My problem was even easier than yours and it was hell to read from pdf. It was just text, but it was formated as 2 pages in one(2 column text). Itextsharp read content by coordinates, so my text got mixed up as he read first line of first column than first line of second column (not as text flows). As for latex, after latex code is converted to pdf there is no reverse to latex code.

Djuro
  • 384
  • 2
  • 9