PDF extract coordinates and create Nested XML files

Question

I am trying extract all the words(chunks) / characters with coordinate from a searchable text PDF invoice / statement by iTextSharp using C# program , after getting coordinate, create an XML file, then read the XML file plot the data to DataGridView. I have tried some methods like iTestSharp. iTextSharp extract each character and getRectangle anyone could suggest a method to create an XML file with the following format XML :

<PDFExtract>
<PageLayout>Style</PageLayout>
<Page>
    <Zone>
        <Line>
        <LOCX>298</LOCX>
        <LOCY>199</LOCY>
        <LOCW>1859</LOCW>
        <LOCH>138</LOCH>
            <WD>
            <LOCX>298</LOCX>
            <LOCY>199</LOCY>
            <LOCW>139</LOCW>
            <LOCH>69</LOCH>
            <T>Start</T>
            </WD>
            <WD>
            <LOCX>476</LOCX>
            <LOCY>216</LOCY>
            <LOCW>63</LOCW>
            <LOCH>55</LOCH>
            <T>Bucks</T>
            </WD>
    </Zone>
</Page>

When you ask the question here then you have to provide some snippets of your code. So can others will get idea and give the solution. — Yogesh Patel, Feb 26 '19 at 06:26
sorry I didn't make myself clear. Currently I am using a desktop program to extract text from scanned documents, After OCR process, it will create a XML file; XML will displayed at format I showed above, The desktop program will parse XML file to DataGridView; Now I need to handling searchable pdf documents, I use the itextSharp to get all the character coordinate in PDF, p.Rect.Left /Right/Bottom/Width/Height, I don't know how to use the coordinate to create the XML file shown above which is fit for the current desktop program to parse XML to DataGridView — KEYBOARDFIGHTer, Feb 26 '19 at 10:34

PDF extract coordinates and create Nested XML files

0 Answers0