4

I'm trying to implement this feature in my application.

File Content Search

Just like in windows, I type into the search box and if the File contents is checked in the settings, than no matter its a text file or pdf/word file, the search returns me the file that contains the string in the search box.

So, I already have come up with a application for files and folder search which works pretty good for the file content search for text files and word file. I'm using interop word for word files.

I know, I can use iTextSharp or some other 3rd party stuff to do this for pdf files. But that doesn't satisfy me. I just wanted to find out how windows does it? Or if anyone else has done it in a different way? I just didn't wanted to use any 3rd party tool but doesn't mean I can't. I just wanted to keep my application light and not dump it with many tools.

StackUseR
  • 884
  • 1
  • 11
  • 40
  • 2
    Basically, your PDF viewer installs an IFilter so Windows can use it to search PDF contents: https://superuser.com/questions/402673/how-to-search-inside-pdfs-with-windows-search – Magnetron Mar 09 '21 at 14:24
  • [This question](https://stackoverflow.com/questions/7313828/using-ifilter-in-c-sharp-and-retrieving-file-from-database-rather-than-file-syst) migth help you. – Magnetron Mar 09 '21 at 14:39

2 Answers2

3

As far as I know, it is not possible to search for pdf content with out having 3rd party tool, software or utility installed. So there are pdfgrep for example. But if you manage to any way make a c# program, I would include a third party library to do the job.

I made a solution for some thing similar in this answer Read specific value based on label name from PDF in C#, with a bit of tweak you can have what you are looking for. The only thing is with PdfClown, it is for .net framework, but at the other hand it is open source, free and has no limitation. But if you are looking for .net core you might find some free (with limitation) or paid pdf libraries.

As you request in the comment here is a sample solution to find text in side pdf pages. I have left comments inside the code:

//The found content
private List<string> _contentList;

//Search for content in a given pdf file
public bool SearchPdf(FileInfo fileInfo, string word)
{
    _contentList = new List<string>();
    ExtractPages(fileInfo.FullName);
    var content = string.Join(" ", _contentList);
    return content.Contains(word);
}

//Extract content for each page of given pdf file
private void ExtractPages(string filePath)
{
    using (var file = new File(filePath))
    {
        var document = file.Document;

        foreach (var page in document.Pages)
        {
            Extract(new ContentScanner(page));
        }
    }
}

//Extract content of pdf page and put the found result inside _contentList
private void Extract(ContentScanner level)
{
    if (level == null)
        return;

    while (level.MoveNext())
    {
        var content = level.Current;
        switch (content)
        {
            case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
            case Text _:
            case ContainerObject _:
                Extract(level.ChildLevel);
                break;
        }
    }
}

Now lets do quick test, so we assume all your invoice are in c:\temp folder:

static void Main(string[] args)
{
    var program = new SearchPdfContent();

    DirectoryInfo d = new DirectoryInfo(@"c:\temp");
    FileInfo[] Files = d.GetFiles("*.pdf");
    var word = "Sushi";
    foreach (FileInfo file in Files)
    {
        var found = program.SearchPdf(file, word);
        if (found)
        {
            Console.WriteLine($"{file.FullName} contains word {word}");
        }
    }
}

In my case I have for example word sushi inside the invoice:

c:\temp\invoice0001.pdf contains word Sushi

All that said, this is an example of solution. You can take it from here bring it to the next level. Enjoy your day.

I leave some links of what I have searched for:

Maytham Fahmi
  • 31,138
  • 14
  • 118
  • 137
  • if you like, I can leave the tweaked code of pdfclown that I made in the answer. – Maytham Fahmi Mar 14 '21 at 12:17
  • surely that would help alot. Actually as suggested in your previous answer, I used pdfclown but my code takes 10mins to search for the specific text, say `invoice`, in 140 pdf files. But I would really like to try with your code. Thanks for replying back – StackUseR Mar 14 '21 at 16:04
  • Sure. Not a problem. Thanks anyways :) – StackUseR Mar 15 '21 at 03:14
  • yes this works fine though takes 25 minutes to finish the task. But as you said, I will try to modify it accordingly. Thanks a ton buddy – StackUseR Mar 15 '21 at 11:50
  • you are in deed welcome, I know the code requires a bit improvement and performance can also improved as well, but that require a bit extra work. hope you can bring it to the next level. Enjoy your day. – Maytham Fahmi Mar 15 '21 at 11:52
2

If your application is meant to search for file contents from binaries stored into your DB, the SQL Full-Text search feature can achieve this for you.

You just need to make sure that you have the required IFilters installed and create a full-text index on the table where the binary data is stored.

But if your application must access a folder in real time and search for file contents, you will probably need a third party tool just like @maytham-ɯɐɥʇʎɐɯ said.

Ishikawa
  • 381
  • 1
  • 5
  • 11