0

I need to be able to uniquely identify a file.

The file would be uploaded to a web application and could be uploaded from anywhere. The file could be mailed to someone, renamed, edited and then uploaded from a different machine altogether and need a way of identifying that this was the original file that was originally uploaded. I need a reliable way of identifying a file across different file systems.

  • I cannot use the filename as the identifier as the file could be renamed. I still need to be able to uniquely identify the file even though it was renamed.
  • I cannot use the the Hash on the file as the Hash would change if the file was edited.
  • I understand Linux has inode number property and windows has the IndexNumber. I can use the NtQueryInformationFile and get the indexNumber. The indexNumber was same when the file was edited and when the file was renamed. But then IndexNumber was different when the file was moved from one folder to another.

From all the reading I have done, it seems like the 'indexNumber' is not reliable for all documents. I almost have a feeling that there is no unique identifier for a file that would be constant across different folders, machines and that would remain unchanged when edited, renamed etc. But here I am StackOverflow. Any help is appreciated.

Edit: Here is the business problem I am trying to solve. A user uploads a file to our web application. Then inputs a bunch of metadata for the file. Similar to adding tags on the file. We keep the file in blob storage but the user still has his local copy that he mails to another user. He maybe edits and renames the file before mailing. When the other user uploads the file to our web application, is there any way we can identify that this was the original file so as to pre-populate the metadata that the original user had entered.

Alvin Saldanha
  • 886
  • 9
  • 20
  • 4
    Can you properly explain the problem you're trying to solve? A file is a virtual construct. What we call a file is just a bunch of bytes. If some bytes get modified, it may still have the same filename, but it's modified. If its contents are bit-by-bit copied to another file, it's the same contents with a different name. If it was moved to external storage or backed up to disk or tape, and then restored to the original file (or another name), it's still the same bytes. You're kind of asking about https://en.wikipedia.org/wiki/Ship_of_Theseus. Are you looking for signing files, proving the src? – CodeCaster Sep 09 '20 at 07:44
  • I have edited the question and added a little bit of context, the business problem I am trying to solve if it helps. – Alvin Saldanha Sep 09 '20 at 07:54
  • 1
    @CodeCaster had a blast reading through that wikipedia page btw. 'In ontological engineering such as the design of practical databases and AI systems, the paradox appears regularly when data objects change over time.' Groeten uit overijssel! – sommmen Sep 09 '20 at 07:59
  • So want to build something like DropBox, Google Docs or SharePoint. All these work because *they* manage the storage location, and keep track of the metadata themselves. Or they use some type of object storage service that does the same thing for them. – Panagiotis Kanavos Sep 09 '20 at 07:59
  • 3
    Still too vague. If a user copies a file and modifies it, it's not the same file anymore. You could use a file format that has an internal identifier to uniquely identify the file, but that could be modified as well. You could sign the file somehow (and keep the signature with or within the file), but then after modifying the signature would differ. You could compare the files and consider a percentage-match a threshold for equality (compare text, pixel colors, bytes). – CodeCaster Sep 09 '20 at 08:00
  • 1
    Specific file formats, like Word, Excel etc contain an identifier you could use to track a document, even if the document is edited. Other formats allow embedding metadata in the file as well. Those formats were created with that ability on purpose though - tracking office documents over email. The trick is of course, what happens when you have *multiple* modified versions? Office documents support versioning for this, but only up to a point – Panagiotis Kanavos Sep 09 '20 at 08:02
  • @PanagiotisKanavos yeah and then someone creates a Word template and sends that around, everyone clears its contents and starts writing new text, are they all the same file? Or the other way around, someone creates two new files and pastes the same contents in it. Different bytes, different identifier, same file? The OP should still be more explicit about the file format and its contents, and why they would consider different files the same. – CodeCaster Sep 09 '20 at 08:04
  • 1
    Frankly, there's a reason only Groove ever did what you ask, by controlling storage as well. It's **very hard** to track documents and changes across machines unless you have real-time communication with a server that can keep track of it. Google Docs, Word Online etc solve this problem by controlling storage *and* the editors. And Office applications can talk to Office Online servers directly – Panagiotis Kanavos Sep 09 '20 at 08:04
  • @CodeCaster The metadata that we are tracking is basically tags for the document or categories that the document falls under. So a Word template copied over could possibly have the same tags. – Alvin Saldanha Sep 09 '20 at 08:07
  • @PanagiotisKanavos Could you point me to the identifier for the office documents that can be tracked. I was able to extract the indexNumber butper this comment it is regenerated everytime the office document is edited and saved. https://stackoverflow.com/questions/1866454/unique-file-identifier-in-windows/1866788#1866788 – Alvin Saldanha Sep 09 '20 at 08:08
  • @CodeCaster Groove handled this - after all, a template isn't the *document* produced by it. The overall problem is *really hard* though. There's a reason Ray Ozzie is considered a genius – Panagiotis Kanavos Sep 09 '20 at 08:08
  • @Panagiotis I didn't really mean Word templates (.dotx), as I very rarely see those out in the wild, but a readonly DocX that gets shared within a department and cleared/modified every time it's needed to base a new document upon. – CodeCaster Sep 09 '20 at 08:11
  • 2
    There is no way to guarantee you can do this, period. You can embed a unique id into the file itself, but that id can be edited and you're back to square 1. Alternate data streams (NTFS), inode (linux) and all similar schemes disappear when you're talking about "uploaded to a web application". You only have 2 things left then, the name of the file (which can be edited) and the content of the file (which can be edited). Nothing else. Let me spell it out for you: **It's not possible to do what you want.** – Lasse V. Karlsen Sep 09 '20 at 08:12
  • @Alvin _"So a Word template copied over could possibly have the same tags."_ - my point is still that the origin of a document tells you nothing about its contents. Someone could have copied a document about subject A because they like its styling, but cleared all contents and wrote something entirely different, subject B. What you're trying to solve is a very hard problem. What you really seem to be looking for is partial text matches, and _that_ is a solved problem. Please [edit] your question to very explicitly mention what you need and what you have tried. – CodeCaster Sep 09 '20 at 08:13
  • 1
    @AlvinSaldanha [that would be docI](https://learn.microsoft.com/en-us/openspecs/office_standards/ms-docx/b5058d55-0aa8-44e0-9a37-0c84b6e9f68b) you're asking the wrong questions. You're trying to build something that was eventually abandoned [even by the company that acquired Groove, Microsoft](https://en.wikipedia.org/wiki/Groove_Networks). That technology was moved to OneDrive and Office Online. Requiring a client on every machine to handle communication with a central server doesn't scale very well – Panagiotis Kanavos Sep 09 '20 at 08:13
  • 2
    Once you understand that it is not possible to guarantee that you can uniquely identify a file that can change across machines, you can start looking at acceptable alternatives. – Lasse V. Karlsen Sep 09 '20 at 08:13
  • @LasseV.Karlsen I was not very optimistic when I asked the question myself. Like I thought, doesn't look like there is a way to uniquely identify a file. Thanks for confirming my doubts. – Alvin Saldanha Sep 09 '20 at 08:16
  • DRM/IRM services allow tracking of documents but again, this requires cooperation from the client applications. It works with Office documents because Office applications support this, just as they support tracking, real-time collaboration etc. They were *explicitly built for this*. – Panagiotis Kanavos Sep 09 '20 at 08:16
  • @AlvinSaldanha oh, there *are* ways, once you accept certain constraints. This has been solved already with Groove. It's even available as a cloud service with Office Online, with specific components available as separate services, like Azure RMS for rights management. Rebuilding the same is not as simple or as cheap as you expected. If you want that functionality, you can buy it. Building it would be really, really hard though – Panagiotis Kanavos Sep 09 '20 at 08:19
  • @PanagiotisKanavos I have no control over the clients neither do I have any flexibility to install anything on the client. I am assuming the solution you are talking about involves something on the client as well right? Though it an option, that's not an option for us. – Alvin Saldanha Sep 09 '20 at 08:22

2 Answers2

3

The simple answer to your question is that it can't be done.

Let me summarize what you're trying to accomplish.

If I have a file on my office computer, and upload that to your web application, you want to store that into your system as a new file. Then, if I copy the file from my office computer to my home computer, edit the file contents, rename the file, and then upload it into your web application, you want to identify that this is the same file as the one I previously uploaded.

It can't be done.

Not with a 100% guarantee that you can identify this.

When you are uploading files to a web application, what is sent is this:

  • The name of the file
  • The length of the file
  • The contents of the file

Things such as alternate data streams (NTFS), from the other answer here, or inode or similar identifiers, from the comments, are not sent. Your web application will not see them. Nor would these things be "across multiple computers".

So bottom line, this is impossible.

Your options are:

  1. Let the user uniquely pick the file they want to overwrite, meaning that the user could pick unrelated files and thus be "wrong"
  2. Work out a reasonable chance that you identified the right file, accepting the chance that you identified incorrectly
  3. Embed a unique id into the file itself, however since the file contents can be edited (and the id can be changed) this is not guaranteed
  4. ... other options that doesn't have a 100% guarantee of being right

The first option is of course the easiest.

The second option could use systems such as what git is doing when it tracks renames, but even this will fail depending on how much the file was edited between the uploads. Git fail in this respect too, except that "failure" here simply means it doesn't show you the full history of a file, it doesn't break down and become unusable.

The third option might work if the file should be edited by a program similar to Word or Excel or Photoshop, etc. You could embed the ID and just make sure that program doesn't change it. It would probably have a higher and acceptable chance of being right, but it might still be possible to edit.

So you will have to decide what would be acceptable to you, but you cannot create a system in which you are guaranteed to identify the file, even if it was renamed and the contents changed. Because at that point you have no guarantee that the user is simply trying to upload a different file altogether.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • There is another option - use Office Online as the necessary tracking functionality is embedded in the server and "client" applications. Of course this probably means losing the customer, unless the rest of the application offers significant advantages over Office Online or SharePoint Online alone. It also means a major redesign of the project, from the business layer down – Panagiotis Kanavos Sep 09 '20 at 08:28
-2

On windows with the NTFS file system you could use alternate data streams or NTFS streams; http://ntfs.com/ntfs-multiple.htm

stream: A sequence of bytes written to a file on the target file system. Every file stored on a volume that uses the file system contains at least one stream, which is normally used to store the primary contents of the file. Additional streams within the file can be used to store file attributes, application parameters, or other information specific to that file. Every file has a default data stream, which is unnamed by default. That data stream, and any other data stream associated with a file, can optionally be named.

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/8ac44452-328c-4d7b-a784-d72afd19bd9f

There is not a lot of official documentation however. But you can inject a GUID there to be able to track the file.

On limitation about this solution is that this only works for the NTFS filesystem, when the file is copied to e.g. a FAT file system, the information is lost.

You need to access native win32 api's however. Check for example this SO: https://stackoverflow.com/a/605167/4122889

Or this random blog: https://blogs.msmvps.com/bsonnino/2016/11/24/alternate-data-streams-in-c/

sommmen
  • 6,570
  • 2
  • 30
  • 51
  • 'When you copy an NTFS file to a FAT volume, such as a floppy disk, data streams and other attributes not supported by FAT are lost'. This will not work. I need a reliable way of identifying a file across different file systems. – Alvin Saldanha Sep 09 '20 at 07:44
  • Yeah but if you copy the file to a FAT USB stick and then copy the file back, the ADS is gone. – CodeCaster Sep 09 '20 at 07:45
  • Hi guys. Yes - i'm aware of this limitation but for clarity i shall add this to the question - for further readers who may stumble across this question. – sommmen Sep 09 '20 at 07:50
  • Done. Added this to the question. – Alvin Saldanha Sep 09 '20 at 07:55
  • 3
    @AlvinSaldanha what you explicitly asked is simply impossible in all OSs. Apart from ensuring the path is always the same, there's no way to track a file across *machines*. Services like Dropbox, OneDrive, Google Docs etc work by keeping a database internally with a file's unique object ID and its current storage location. – Panagiotis Kanavos Sep 09 '20 at 07:58