2

I want to read the coordinates of a particular line in a particular page of the pdf using python. However, I am unable to find the suitable library to do so. Therefore, I'm using this code mentioned below in C#. Anyone who can help me to find a wrapper in python through which this code becomes operational in python.

Code:

using System;
using System.Drawing;
using Bytescout.PDFExtractor;

<span data-scayt_word="namespace" data-scaytid="18">namespace</span> <span data-scayt_word="FindText" data-scaytid="19">FindText</span>
{
    class Program
    {
        static void Main(string[] <span data-scayt_word="args" data-scaytid="43">args</span>)
        {
            // Create Bytescout.PDFExtractor.TextExtractor instance
            <span data-scayt_word="TextExtractor" data-scaytid="20">TextExtractor</span> extractor = new <span data-scayt_word="TextExtractor" data-scaytid="21">TextExtractor</span>();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            // Load sample PDF document
            extractor.LoadDocumentFromFile("sample1.pdf");

            <span data-scayt_word="int" data-scaytid="22">int</span> <span data-scayt_word="pageCount" data-scaytid="48">pageCount</span> = extractor.GetPageCount();
            <span data-scayt_word="RectangleF" data-scaytid="50">RectangleF</span> location;

            for (<span data-scayt_word="int" data-scaytid="23">int</span> i = 0; i < <span data-scayt_word="pageCount" data-scaytid="49">pageCount</span>; i++)
            {
                // Search each page for "<span data-scayt_word="ipsum" data-scaytid="24">ipsum</span>" string
                if (extractor.Find(i, "<span data-scayt_word="ipsum" data-scaytid="25">ipsum</span>", false, out location))
                {
                    do
                    {
                        Console.WriteLine("Found on page " + i + " at location " + location.ToString());

                    }
                    while (extractor.FindNext(out location));
                }
            }

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadLine();
        }
    }
}
Prabal
  • 51
  • 1
  • 6

1 Answers1

3

I see three options for you to run this code from a python program (assuming you are on Windows):

Preferrable: If it is possible for you to use the IronPython interpreter (see ironpython.net), you can use the PDFExtractor class directly from the python code:

import clr    
clr.AddReferenceToFileAndPath('c:\\path\\to\\pdfextractor.dll')
from Bytescount.PDFExtractor import TextExtractor
extractor = TextExtractor()
extractor.RegistrationName = 'demo'
# etc

Alternatively: Use the C# compiler csc.exe to compile your C# program before you run it (save your C# program as Extract.cs, make sure that it accepts the path to the pdf-file as input parameter):

import os,tempfile,shutil
csc = 'c:\\WINDOWS\\Microsoft.Net\\Framework64\\v4.0.30319\\csc.exe' # Or somewhere else, see below
filename = 'c:\\path\\to\\pdffile.pdf'
tempdir = tempfile.mkdtemp(prefix='Extract-temp-')
os.system(csc + ' /t:exe /out:' + tempdir + '\\Extract.exe c:\\path\\to\\Extract.cs /r:c:\\path\\to\\PDFExtractor.dll')
with os.popen(tempdir + '\\Extract.exe '+filename) as F:
    extractResult = F.read()
shutil.rmtree(tempdir)
print(extractResult)

Up to .NET Framework version 4.5 / C# 5, csc.exe was included in the framework install. To get a version of csc.exe that supports C# 6.0, consult e.g. stackoverflow.com/questions/39089426.

Finally, you can use ctypes and the "Unmanaged Exports (DllExport for .Net)" NuGet package to call a C# assembly directly from CPython, as outlined in stackoverflow.com/questions/7367976.

EDIT based on denfromufa's comment: The best way to script PDFExtractor from python is to use pythonnet in CPython (you can install it on windows by python -m pip install pythonnet) With this approach, your C# program above can be replaced with this script (tested with python 2.7, win32):

import clr
# 'import System'  will work here (must be after 'import clr')
# You can also import System.Drawing and other .NET namespaces
clr.AddReference(r'c:\path\to\Bytescout.PDFExtractor.dll')
from Bytescout.PDFExtractor import TextExtractor
extractor = TextExtractor()
extractor.RegistrationName = 'demo'
extractor.RegistrationKey = 'demo'
extractor.LoadDocumentFromFile(r'c:\path\to\mydoc.pdf')
pageCount = extractor.GetPageCount()
for i in range(pageCount):
    result = extractor.Find(i,"somestring",False)
    while (result):
        print('Found on page '+str(i)+' on location '+str(extractor.FoundText.Bounds))
        result = extractor.FindNext()
Community
  • 1
  • 1
sveinbr
  • 106
  • 5
  • Hey @sveinbr ! Thank you for the help. The last method can surely work out. However, can you please help me how I can use the initial 3 libraries (Bytescout.PDFExtractor, System.Drawing, System) of C# in Python? – Prabal May 19 '17 at 11:52
  • @Prabal: Hi, as denfromufa pointed out in the comment, pythonnet can do the same thing as the last method (marked Finally) in a better way. I will update the answer with a recipe. – sveinbr May 20 '17 at 12:05
  • @sveinbr: The method stated by you and denfromufa may work for sure. However, notice the line: clr.AddReference(r'c:\path\to\Bytescout.PDFExtractor.dll') . The problem is Bytescout library is a paid one. Can you suggest any alternative to this library? – Prabal May 22 '17 at 12:24
  • @Prabal: I have no experience with PDFExtractor or similar libraries myself, but after doing a quick google search, I think the open source and pure python [pdfminer](https://euske.github.io/pdfminer/) could be what you need. (btw: please accept my answer if you find that it solved the question in the original post) – sveinbr May 22 '17 at 14:13
  • @Prabal accepting correct answers is considered polite on StackOverflow. You can use gray check box under voting buttons to mark answer as accepted. – Vasily Ryabov Mar 08 '19 at 19:39