0

So i have a main directory with sub folders and around 500k images. I know alot of theese images does not exist in my database and i want to know which ones so that i can delete them.

This is the code i have so far:

var listOfAdPictureNames = ImageDB.GetAllAdPictureNames();

var listWithFilesFromImageFolder = ImageDirSearch(adPicturesPath);

var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));

var differenceList = result.ToList();

listOfAdPictureNames is of type List<string>

here is my model that im returing from the ImageDirSearch:

public class CheckNotUsedAdImagesModel
{
    public List<ImageDirModel> ListWithUnusedAdImages { get; set; }
}

public class ImageDirModel
{
    public string FileName { get; set; }
    public string Path { get; set; }
}

and here is the recursive method to get all images from my folder.

private List<ImageDirModel> ImageDirSearch(string path)
        {
            string adPicturesPath = ConfigurationManager.AppSettings["AdPicturesPath"];
            List<ImageDirModel> files = new List<ImageDirModel>();

try
{
    foreach (string f in Directory.GetFiles(path))
    {
        var model = new ImageDirModel();
        model.Path = f.ToLower();
        model.FileName = Path.GetFileName(f.ToLower());
        files.Add(model);
    }
    foreach (string d in Directory.GetDirectories(path))
    {
        files.AddRange(ImageDirSearch(d));
    }
}
catch (System.Exception excpt)
{
    throw new Exception(excpt.Message);
}

return files;

}

The problem I have is that this row:

var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));

takes over an hour to complete. I want to know if there is a better way to check in my images folder if there are images there that doesn't exist in my database.

Here is the method that get all the image names from my database layer:

    public static List<string> GetAllAdPictureNames()
    {
        List<string> ListWithAllAdFileNames = new List<string>();

        using (var db = new DatabaseLayer.DBEntities())
        {
            ListWithAllAdFileNames = db.ad_pictures.Select(b => b.filename.ToLower()).ToList();
        }



        if (ListWithAllAdFileNames.Count < 1)
            return new List<string>();

        return ListWithAllAdFileNames;
    }
  • Does `GetAllAdPictureNames` use `GetFiles` or `EnumerateFiles`? The same for your ImageDirSearch, it doesn't make much sense to create your own `FileInfo` class when you can just use [`DirectoryInfo.EnumerateFiles`](https://msdn.microsoft.com/en-us/library/dd413232%28v=vs.110%29.aspx), as to how much this helps. I dont know, but it definitely will – Sayse Apr 01 '15 at 07:13
  • I updated my post so you can see the method for your self. – steffokeffo Apr 01 '15 at 07:17
  • I'll try to come up with an answer with some suggestions.. – Sayse Apr 01 '15 at 07:17
  • 2
    A [HashSet](https://msdn.microsoft.com/library/bb359438.aspx) is optimized for [Contains](https://msdn.microsoft.com/library/bb356440.aspx). So you might want to read all the strings in `listOfAdPictureNames` into a `HashSet hashSetOfAdPictureNames` and then try `...Where(p => !hashSetOfAdPictureNames.Contains(p.FileName))` to see if that's (a lot) faster. – Corak Apr 01 '15 at 07:20
  • ... and [Except](https://msdn.microsoft.com/library/vstudio/bb300779.aspx) seems to use a set behind curtains, so go for that and credits to @spersson – Corak Apr 01 '15 at 07:27
  • @Corak - Its probably worth making that an answer – Sayse Apr 01 '15 at 07:35

3 Answers3

1

Perhaps Except is what you're looking for. Something like this:

var filesInFolderNotInDb = listWithFilesFromImageFolder.Select(p => p.FileName).Except(listOfAdPictureNames).ToList();

Should give you the files that exist in the folder but not in the database.

spersson
  • 538
  • 1
  • 8
  • 19
  • That only works with 2 lists of strings, one of my lists contains objects of type ImageDirModel – steffokeffo Apr 01 '15 at 07:19
  • 1
    @steffokeffo - `listWithFilesFromImageFolder.Select(p => p.FileName).Except(...)`. – Corak Apr 01 '15 at 07:21
  • If i use this the return type is of type , i need it to be of type ImageDirModel because i need both the filename and the path in my list that im returning. – steffokeffo Apr 01 '15 at 07:48
  • @steffokeffo Isn't it possible to use the path from adPicturesPath? – spersson Apr 01 '15 at 07:53
  • @spersson no because there is 1000 subfolders and i need to know exactly where all the specifc images are located. Im using System.IO.File.Delete and method takes a complete path to the file being deleted including its filename – steffokeffo Apr 01 '15 at 07:57
0

As I said in my comment, you seem to have recreated the FileInfo class, you don't need to do this, so your ImageDirSearch can become the following

private IEnumerable<string> ImageDirSearch(string path)
{
    return Directory.EnumerateFiles(path, "*.jpg", SearchOption.TopDirectoryOnly);
}

There doesn't seem to be much gained by returning the whole file info where you only need the file name, and also this only finds jpgs, but this can be changed..

The ToLower calls are quite expensive and a bit pointless, so is the to list when you are planning on querying again so you can get rid of that and return an IEnumerable again, (this is in the GetAllAdPictureNames method)

Then your comparison can use equals and ignore case.

!listOfAdPictureNames.Any(q => p.Equals(q, StringComparison.InvariantCultureIgnoreCase));

One more thing that will probably help is removing items from the list of file names as they are found, this should make the searching of the list quicker every time one is removed since there is less to iterate through.

Sayse
  • 42,633
  • 14
  • 77
  • 146
  • Note: As Corak has stated, a hash set will probably help too :) – Sayse Apr 01 '15 at 07:25
  • First time when i wrote this method i tried the Directory.EnumareFiles and its 3x slower then the recursive method i wrote. I don't know why that's the case. Also it searches for specific types, i just want to load all the files from the main folder and its sub folders. – steffokeffo Apr 01 '15 at 07:29
  • @steffokeffo - Hmm thats surprising, but fair enough, I definitely don't see a need for the `ToLowers` though, did you try with `DirectoryInfo.EnumerateFiles` too? – Sayse Apr 01 '15 at 07:32
  • yes i did, i removed the ToLower as well, and yes it was a little bit faster, but its really marginal. If i would estimate it made about 2-4% difference in time which really isnt big enough. – steffokeffo Apr 01 '15 at 07:51
  • @steffokeffo - Yeah I didn't expect it to be much but at least it would help. I've also updated my answer since then suggesting that you remove entries from the list of file names as you go as that will give you less to search through every time – Sayse Apr 01 '15 at 07:53
  • And i forgot to mention, the reason why im returning the path is well cause i have a delete button in my view, and that button takes the List result , and uses the path property to remove all the files. – steffokeffo Apr 01 '15 at 07:54
  • @steffokeffo - You can still use the `FileInfo` class to do that, I wouldn't have thought it would make any performance differences but will make sure your program's compatibility is optimal – Sayse Apr 01 '15 at 07:59
  • @steffokeffo - You may also be interested in [this answer by Jon Skeet](http://stackoverflow.com/a/1660232/1324033), it outlines how `ToLower` can return invalid results – Sayse Apr 01 '15 at 08:14
  • well i figured since i only have the filename from the db and i need the path to delete the file i would create an object that held both values. – steffokeffo Apr 01 '15 at 08:14
  • 1
    i read the post, really interesting and i didn't know that. Thank you! :) – steffokeffo Apr 01 '15 at 08:34
0

Instead of the search being repeated on each of these lists its optimal to sort second list "listOfAdPictureNames" (Use any of n*log(n) sorts). Then checking for existence by binary search will be the most efficient all other techniques including the current one are exponential in order.

Neeraj
  • 596
  • 7
  • 9