14

I have a file indexing database on Linux. Currently I use file path as an identifier. But if a file is moved/renamed, its path is changed and I cannot match my DB record to the new file and have to delete/recreate the record. Even worse, if a directory is moved/renamed, then I have to delete/recreate records for all files and nested directories.

I would like to use inode number as a unique file identifier, but inode number can be reused if file is deleted and another file created.

So, I wonder whether I can use a pair of {inode,crtime} as a unique file identifier. I hope to use i_crtime on ext4 and creation_time on NTFS. In my limited testing (with ext4) inode and crtime do, indeed, remain unchanged when renaming or moving files or directories within the same file system.

So, the question is whether there are cases when inode or crtime of a file may change. For example, can fsck or defragmentation or partition resizing change inode or crtime or a file?

Interesting that http://msdn.microsoft.com/en-us/library/aa363788%28VS.85%29.aspx says:

  • "In the NTFS file system, a file keeps the same file ID until it is deleted."
    but also:
  • "In some cases, the file ID for a file can change over time."

So, what are those cases they mentioned?

Note that I studied similar questions:

but they do not answer my question.

Community
  • 1
  • 1
jhnlmn
  • 381
  • 3
  • 11
  • The{device_nr, inode_nr} is a unique id for a file ("inode") on a system. These are guaranteed to be stable (*maybe* with an exception for NFS) Moving a file does not change the inode it just moves the link to an inode to another directory. Moving *across* filesystems is different. BTW: the microsoft documentation mentions NTFS as a possible *exception* to the rule (just like my NFS exception, and possibly NAS/SAN storage? ) – wildplasser Apr 17 '13 at 20:53
  • BTW: what is crtime? UFS only has {ctime,atime,mtime} – wildplasser Apr 17 '13 at 21:19
  • crtime is file creation time. It was added in ext4. It was discussed in detail in other posts. – jhnlmn Apr 17 '13 at 21:28
  • For Linux, because inodes can be reused in NFS, and because you might be indexing files on NFS, you can't use only the INODE (we agree on this). So, for Linux, I'd use the inode and the inode generation as a unique key. Let us know what solution you went ahead with (this question has no answer). – Pierre Jan 20 '21 at 16:27

2 Answers2

6
  • {device_nr,inode_nr} are a unique identifier for an inode within a system
  • moving a file to a different directory does not change its inode_nr
  • the linux inotify interface enables you to monitor changes to inodes (either files or directories)

Extra notes:

  • moving files across filesystems is handled differently. (it is infact copy+delete)
  • networked filesystems (or a mounted NTFS) can not always guarantee the stability of inodenumbers
  • Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored (except for NTFS's internals)

Extra text: the old Unix adagium "everything is a file" should in fact be: "everything is an inode". The inode carries all the metainformation about a file (or directory, or a special file) except the name. The filename is in fact only a directory entry that happens to link to the particular inode. Moving a file implies: creating a new link to the same inode, end deleting the old directory entry that linked to it. The inode metatata can be obtained by the stat() and fstat() ,and lstat() system calls.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • 2
    wildplasser wrote: "The {device_nr, inode_nr} is a unique id for a file ("inode") on a system. These are guaranteed to be stable " ................... "guaranteed" by whome? Do you have a reference to some documentation? Because a quick googling for "guaranteed to be stable" inode brings "Inode numbers are -not- guaranteed to be stable" .................... Also, inode is not sufficient because they can be reused. So, I decided to use crtime as well. And I need to understand how stable it is. – jhnlmn Apr 17 '13 at 21:25
  • If the inode number is reused it first has to be unused. (if the link count goes to zero the inode is unused/non-existent, and *then* it can be reused, including its number) See `stat(2)` For hard-core information: see Bach or the source + documentation for the filesystem + drivers. – wildplasser Apr 17 '13 at 21:30
  • 1
    wildplasser wrote: " If the inode number is reused it first has to be unused." jhnlmn: Sure, I wrote "but inode number can be reused if file is deleted and another file created." in my question – jhnlmn Apr 17 '13 at 21:34
  • The inode number is an identifier for an *existing* file on your filesystem. Storing it in a database is and then deleting the file is like taking a foreign key in a DBMS, without *CASCADE* option. Your system would still refer to a file that does not exist anymore, by means of a number that now refers to a different entity. The solution is simple: don't delete files, and if you do delete files: remove all the references to it. – wildplasser Apr 17 '13 at 21:40
  • wildplasser wrote: Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored jhnlmn: This is not correct. A disk can be plugged in Linux, then in Windows, then back to Linux. Or Windows and Linux may be booted one after another on a dual boot system. So, understanding how MS handles file IDs is equally important. Right now I have more confidence about stability of file IDs in NTFS under Windows than under Linux, which is less documented. – jhnlmn Apr 17 '13 at 21:41
  • wildplasser wrote: The solution is simple: don't delete files, and if you do delete files: remove all the references to it. ................. jhnlmn: And how to do that? How to differentiate between file deletion or file being moved? This was my question from the very beginning. Sorry if I did not make it clear. (PS: and, please, do not tell me to use iNotify because it is not scaleable). – jhnlmn Apr 17 '13 at 21:44
  • NOTE the selective quoting (*except for NTFS's internals*) BTW: You can still read the source. It's open, you know. – wildplasser Apr 17 '13 at 21:44
  • > NOTE the selective quoting (except for NTFS's internals) I tried to follow editing-help#comment-formatting on http://stackoverflow.com, but nothing works.
    BTW: You can still read the source. It's open, you know.
    Yes, I know. I am reading sources for a while, but I cannot get a definitive answer since there is always a chance that I will miss one line out of 1 mln. Also, if reading sources would be an answer, then sites like stackoverflow.com will not exist.
    – jhnlmn Apr 17 '13 at 21:50
  • @jhnlmn: please clarify your statements. They appear to be troubled by FUD, IMHO. – wildplasser Mar 04 '14 at 23:11
6

The allocation and management of i-nodes in Unix is dependent upon the filesystem. So, for each filesystem, the answer may vary.

For the Ext3 filesystem (the most popular), i-nodes are reused, and thus cannot be used as a unique file identifier, nor is does reuse occur according to any predictable pattern.

In Ext3, i-nodes are tracked in a bit vector, each bit representing a single i-node number. When an i-node is freed, it's bit is set to zero. When a new i-node is needed, the bit vector is searched for the first zero-bit and the i-node number (which may have been previously allocated to another file) is reused.

This may lead to the naive conclusion that the lowest numbered available i-node will be the one reused. However, the Ext3 file system is complex and highly optimised, so no assumptions should be made about when and how i-node numbers can be reused, even though they clearly will.

From the source code for ialloc.c, where i-nodes are allocated:

There are two policies for allocating an inode. If the new inode is a directory, then a forward search is made for a block group with both free space and a low directory-to-inode ratio; if that fails, then of he groups with above-average free space, that group with the fewest directories already is chosen. For other inodes, search forward from the parent directory's block group to find a free inode.

The source code that manages this for Ext3 is called ialloc and the definitive version is here: https://github.com/torvalds/linux/blob/master/fs/ext3/ialloc.c

  • Sure, inodes can be reused, I mentioned this at the beginning. My question was whether a pair of {inode,crtime} can be used as a unique identifier, in particular whether these values can change for an existing file. – jhnlmn Aug 30 '14 at 00:51
  • 1
    The answer depends on whether you want to take the risk that filesystem algorithms will change or not. Currently, i-node numbers do not change until a file is removed. In addition, the kernel does not change the i-node number when a file is truncated using truncate(2). Many programs, however, recreate files when they are modified, changing the i-node number. If you have written all software involved, and know specifically that the i-node itself is not going to be removed, then you can take the risk, at least on the ext3 filesystem. Considering the number of factors, I would be cautious. – Gary Wisniewski Sep 09 '14 at 01:44
  • Do you know how filesystems with dynamic inodes work in relation to reusing inodes? Do ZFS and BTRFS also reuse inodes even while dynamically allocating new inodes. Also are all these techniques based on bit vectors? What is if there is a million files, then this inode bit vector would have a million bits right? – CMCDragonkai Jun 17 '17 at 08:00