0

Convert a string Filepath to unique identifier.

This is the kind of filepaths that i need to convert to a unique ID.( int would be preferred)

D:\Images\PSSL\2019\Team_Colours\Base_1\Generic.png
D:\Images\Generic.png
D:\Images\Generic\Images\2019\Base.png

the path will wary from image to image

Apoligies for not posting any code as i am bit lost on how to proceed

  • 3
    The process of converting something to an unique ID is generally called *hashing*. That gives you a term to research on. Related: [How to get the MD5 hash of a file in C++?](https://stackoverflow.com/questions/1220046/how-to-get-the-md5-hash-of-a-file-in-c) – TrebledJ Jun 05 '19 at 05:21
  • 2
    I don't understand. Isn't the path *itself* already a unique identifier? Or are you trying to avoid symlinks? Or do you *actually* mean "integer" when you say "unique identifier". – Nicol Bolas Jun 05 '19 at 05:27
  • 3
    warning hashing can produce the same value for different strings – bruno Jun 05 '19 at 05:27
  • Nicol i would want to convert the string to a unsigned integer value –  Jun 05 '19 at 05:28
  • Bruno than i believe hashing wont be best solution for that. –  Jun 05 '19 at 05:30
  • @shomit yes it is available under windows – bruno Jun 05 '19 at 05:39
  • 1
    @bruno technically true, but with an appropriate hash (e.g. MD5), the chances of a hash collision are small enough that they can typically be ignored. – Jeremy Friesner Jun 05 '19 at 05:53
  • @JeremyFriesner a program which have a _small enough_ chance to not work is a program which do not work ^^ – bruno Jun 05 '19 at 06:11
  • @shomit I edited my answer to add a proposal of class giving unique ID for strings – bruno Jun 05 '19 at 06:11
  • You in general cannot convert something to something else that is both unique and smaller than the original something. Perhaps this is an XY problem. Why do you need this? – n. m. could be an AI Jun 05 '19 at 06:30
  • @bruno and yet people rely on secure hashing all the time, and the world doesn't end :) – Jeremy Friesner Jun 05 '19 at 15:51
  • 1
    @JeremyFriesner the OP requests _unique identifier_, by *definition* a _hash_ **cannot** other that – bruno Jun 05 '19 at 15:56

1 Answers1

1

Your string are not any string but path, if the corresponding files/dir always exist you can use their node number (field d_ino in struct dirent)

Note : dirent is available on Linux/Unix/Windows, if you do not have it because of the compiler you use look at List of all files inside the folder and its subfolders in Windows


If the file/dir may not exist you can make a dictionary string -> int by yourself, example :

#include <iostream>
#include <string>
#include <map>
#include <list>

class UI {
  public:
    UI() : next(1) {}
    unsigned search(std::string) const;
    unsigned get(std::string);
    unsigned forget(std::string);

  private:
    std::map<std::string, unsigned> m;
    std::list<unsigned> free;
    unsigned next;
};

unsigned UI::search(std::string s) const {
  std::map<std::string, unsigned>::const_iterator it = m.find(s);

  return (it == m.end()) ? 0 : it->second;
}

unsigned UI::get(std::string s) {
  std::map<std::string, unsigned>::const_iterator it = m.find(s);

  if (it != m.end())
    return it->second;

  unsigned r;

  if (!free.empty()) {
    r = free.front();
    free.pop_front();
  }
  else
    r = next++;

  m[s] = r;
  return r;
}

unsigned UI::forget(std::string s) {
  std::map<std::string, unsigned>::const_iterator it = m.find(s);

  if (it == m.end())
    return 0;

  unsigned r = it->second;

  m.erase(it);

  if (r == (next - 1))
    next -= 1;
  else
    free.push_back(r);

  return r;
}

int main(void)
{
  UI ui;

  std::cout << "aze " << ui.search("aze") << std::endl;  
  std::cout << "aze " <<  ui.get("aze") << std::endl;
  std::cout << "qsd " <<  ui.get("qsd") << std::endl;
  ui.forget("aze");
  std::cout << "aze " << ui.search("aze") << std::endl;
  std::cout << "wxc " <<  ui.get("wxc") << std::endl;
  return 0;
}

Compilation and execution :

pi@raspberrypi:/tmp $ g++ -pedantic -Wall -Wextra c.cc
pi@raspberrypi:/tmp $ ./a.out
aze 0
aze 1
qsd 2
aze 0
wxc 1
pi@raspberrypi:/tmp $ 

Notes :

  • I do not check if all the possible values of an unsigned int are already used when you enter a new string, you will have problem of memory before that case, or use a 64b unsigned to be sure ;-)

  • the ID of a string is certainly unique but depends on the historic, a hash do not depends on an historic but several strings may have the same hash

bruno
  • 32,421
  • 7
  • 25
  • 37
  • @DavidC.Rankin I do not understand, what do you mean ? – bruno Jun 05 '19 at 05:43
  • @DavidC.Rankin look at https://stackoverflow.com/questions/15643857/do-i-need-to-allocate-memory-for-a-dirent-structure for instance, or https://stackoverflow.com/questions/54363548/list-of-all-files-inside-the-folder-and-its-subfolders-in-windows – bruno Jun 05 '19 at 05:45
  • 1
    Note that with this approach, the same path-string would correspond to different integer values on different machines (and even on the same machine at different times, e.g. if a file was deleted and then later recreated at the same location). Also you won't be able to get a node-number for a path that doesn't currently exist. – Jeremy Friesner Jun 05 '19 at 05:50
  • @JeremyFriesner yes, this is why I say "_if the corresponding files/dir always exist_" and why I added an other proposal ;-) It is not sure the OP want the same number on different host / executions – bruno Jun 05 '19 at 06:09
  • Number can vary on different machines. –  Jun 05 '19 at 07:05