We have a images folder which has about a million images in it. We need to write a program which would fetch the image based upon a keyword that is entered by the user. We need to match the file names while searching to find the right image. Looking for any suggestions. Thanks N
9 Answers
Keep the images on a separate site or subdomain. You probably don't want all 1M files in a single directory, of course.
You need a database with (at least) three tables:
ImageFile ID Filepath Keyword ID theWord ImageKeyword ImageID KeywordID

- 24,650
- 8
- 50
- 93
-
1As well as this, you could hash the image so that you can check if the image actually already exists. Don't use MD5 as it can produce the same result for different files - try SHA1 or higher. – Dominic Zukiewicz Jun 16 '10 at 12:56
-
@Dominic: sure. What kind of app are you thinking of that would benefit from that? – egrunin Jun 16 '10 at 15:32
-
@Dominic Zukiewicz: "Don't use MD5 and instead use SHA-1"?! Fine, MD5 is 128 bits and SHA-1 is 160, but feeding SHA-1 with anything larger than 80 bytes will eventually result in a collision. Saying that SHA-1 will never produce collisions is just silly talk.. – Patrick Jun 16 '10 at 15:42
-
If you want to check if the exact file already exists in the DB, a hash would help. But I was saying that certain algorithms have been known to produce the same key for complete different files. @Patrick - I appreciate that these algorithms have been broken and especially with images having such a diversity of data. Can we agree on SHA-256? Just trying to balance speed with data compactness. – Dominic Zukiewicz Jun 17 '10 at 09:57
Store all (images & keywords) in a database.
You can use a full-text index to search for the words, or store each word as a seperate entry.
And you will have much faster access to the meta data (filename, creation date, etc) without retrieving (or opening) the image itself.
This is probably much faster as relying on a file system that is not made to store one million entries in a single folder.

- 52,015
- 16
- 101
- 139
This is the obvious but would imagine it would be pretty slow for a million images:
public IList<string> GetMatchingImages(string path, string keyword)
{
var matches = new List<string>();
var images = System.IO.Directory.GetFiles(path);
foreach (var image in images)
{
if (image.Contains(keyword))
{
matches.Add(image);
}
}
return matches;
}

- 9,558
- 7
- 51
- 76
Depending on the operating system, I suggest you use Indexing Service, Windows Desktop Search, or the latest version of Windows Search. This solves your problem of file lookup based on keyword, it addresses the performance issues in regards to the number of files within a folder, it is scalable, and easily extended.
The DSearch sample at http://msdn.microsoft.com/en-us/library/dd940335(VS.85).aspx does almost exactly what you want and is easy to implement.
For example, if querying a million files and need to move file into subfolders to increase performance then you can simply create the folders and move the files. You will not need to change any code.
If you need to change how keywords are applied, such as using the keywords of the file's summary properties, then you only need to change the query.
For the later operating systems, you do not even need to install any software because the search feature is part of the operting system and available through OleDB. If you want to use Advance Query Syntax (AQS), Microsoft provides a typed-library to access the COM Interfaces that make it easy to generate the SQL command to query the index database.
Honestly, all these other suggestions about databases, and so on, are a waste of time.
MSDN search of windows search at http://social.msdn.microsoft.com/Search/en-US?query=windows+search
Related Search Technologies to Windows Search at http://msdn.microsoft.com/en-us/library/bb286798(VS.85).aspx
Searching a million files in one folder is going to be prohibitive slow. (See my response at Directory file size calculation - how to make it faster? for Directory file size calculation - how to make it faster?.
I can search my hard drive of ~300,000 files for *tabcontrol.cs" in less that a second The first query takes approx. 4000ms and each query, using a different search term, after the first one takes 300-600ms.
- I just updated from "Indexing Service" to "Windows Search" and I can search 300,000 files over 58GB for "filename: tabcontrol" in 1.25 seconds with subsequent searches taking 0.13 to 0.26 seconds.
See the DSearch sample at http://msdn.microsoft.com/en-us/library/dd940335(VS.85).aspx for how easy this is to implement.
"Searching the Desktop" at http://blogs.msdn.com/b/coding4fun/archive/2007/01/05/1417884.aspx
Searching for a file across a hard drive is a slow, tedious operation. Learn how to take advantage of the Windows Desktop Search API and database to find files very quickly. Add innovative new features to your applications using the search capabilities built-in to Vista and available for Windows XP.
-
These methods work if the keywords will be embedded in the file metadata. The people suggesting databases are assuming otherwise, and that he wants centralized editing of keywords. – egrunin Jun 16 '10 at 15:43
-
@egrunin: You can store the keywords in the file's Summary information provided by the operating system, which is stored as an Alternate Data Stream. Keyword can be managed through Windows Explorer. Everything is already provided. – AMissico Jun 16 '10 at 16:13
Getting a million file names from a folder will take a lot of time. I would suggest that you get the file names and put them in a database. That way you can search the names within seconds instead of minutes.

- 687,336
- 108
- 737
- 1,005
There is Win32 API FindFirstFile, FindNextFile, FindClose: http://msdn.microsoft.com/en-us/library/aa364418(VS.85).aspx - probably they map somehow into .NET as well. Use them to search for the image without any databases.

- 2,410
- 2
- 21
- 24
My first thoughts for such a large number of images would be to create an inverted-list to use as an index.
If you are able to maintain this list it would make searching relatively quick and you wouldn't have to trawl through a million images which I'm guessing would be too time consuming for you.
I'd start with looking for some inverted-list implementations.

- 1,606
- 17
- 29
One simple solution is a database in which you store the an ID, the path, and a varchar (string) field in which you'll keep all the keywords. (That could be stored in a different table for efficiency purposes)
That way you could search by filename or by keywords associated to an image.

- 529
- 1
- 4
- 13
Just rename all the images to their respective keywords delimited by spaces. Then use the OS's own search feature.
If that doesn't work, only then look for fancier solutions.

- 4,742
- 10
- 44
- 52