9

I have a directory listing for which I want to retrieve the filenames and put them in a vector of strings such that they are sorted in a "natural" manner. e.g. { "10.txt" "0.txt" "2.txt" "1.m" "Jan12" "July13.txt" "Nov25.txt" "Jane" "John" } should be {"0.txt" "1.m" "2.txt" "10.txt" "Jan12" "July13.txt" "Nov25.txt" "Jane" "John" } . What is the easiest way to do this?

Elaborating on "natural" we assume for a string made up from parts of numbers (N) and text (T) such that ...(N)(T)... , then for ...(N1)(T1)... and ...(N2)(T2)... will be (N1<N2) (<) (T1<T2) where (<) implies left term precedence over right term. In this case, numbers take precedence over text fields if they are in the same position in the string, i.e. 1.z (<) 1_t.txt .

Is there already a library function to do that kind of sort for alphanumeric strings, or directory entries?

DESIRED order in which files should come. The file names will be stored in a vector of strings.

Abhinav@Abhinav-PC /cygdrive/c/AbhinavSamples/shell
$ ls -lv
total 8
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 1.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:55 1_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:50 3.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 4.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 10.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 10_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 13.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 20.txt

**Simple Sort**
Abhi@Abhi-PC /cygdrive/c/AbhinavSamples/shell
$ ls -l
total 8
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 1.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 10.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 10_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 13.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:55 1_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 20.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:50 3.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 4.txt
aschepler
  • 70,891
  • 9
  • 107
  • 161
Abhinav
  • 1,496
  • 3
  • 15
  • 31
  • http://www.cplusplus.com/reference/clibrary/cstdlib/qsort/ , but i'm not sure if that will do what you want. Why not write a simple sort function? That's the fun of using C/C++ :P, they don't have libraries that do anything you can think of. – bbedward Mar 16 '12 at 19:42
  • 1
    do u want it using any unix script or other programming lang? – Teja Mar 16 '12 at 19:48
  • 1
    bash script? C? C++? perl? python? – karlphillip Mar 16 '12 at 19:59
  • writing own /copying from google means attracting people to doubt your code. Standard stuff is better and you can rely the api thereby you are relieved from writing and executing the test cases and blah b;ah @karlphillip, Venk definitely STL CPP – Abhinav Mar 16 '12 at 20:18
  • Then you should edit your question and add the appropriate tags, like C++. – karlphillip Mar 16 '12 at 20:27
  • Do you mean a natural merge sort? What algorithm are you looking for here? – Nicol Bolas Mar 16 '12 at 20:44
  • @bbedward If you use C++, you should go into std::sort instead of qsort - there is a performance difference, despite the two use the same algorithm. – Rafał Rawicki Mar 16 '12 at 20:53
  • see e.g. http://stackoverflow.com/questions/4622516/sorting-stdstrings-with-numbers-in-them – Philipp Mar 16 '12 at 21:16
  • possible duplicate of http://stackoverflow.com/questions/642213/how-to-implement-a-natural-sort-algorithm-in-c – Philipp Mar 16 '12 at 21:24
  • Do the filenames always begin with a number part? Or can filenames look arbitrary e.g. 'abc123_567zxa.txt'? – Christian Ammer Mar 16 '12 at 22:21
  • the filename may be either way but they have a format a file name will follow a patern like stringNum.ext here string will be same for all the files, Num will be numbers. An example of which may be like Jan1.txt, Jan2.txt, ... Jan10.txt, Jan12.txt and so on. – Abhinav Mar 20 '12 at 07:12
  • In the first example "10.txt" is not in the results, and "1.m" is in there twice. I still can't figure out the intended order though. Also, the description of `(<)` is backwards. – Mooing Duck Mar 21 '12 at 21:38
  • @MooingDuck instead of complaining you can edit the obvious I believe! –  Mar 21 '12 at 23:10
  • the term "natural sort" is ill posed, e.g. should "2_t" come after "1z.txt" and beofre "11" or after "11" ? –  Mar 21 '12 at 23:22

4 Answers4

13

You need a function which makes the natural comparison between two strings. After that you can use std::sort with the comparison function as third argument (as @chac already pointed out). In the following I tried to implement such a function in a recursive way. Note, that it can handle arbitrary filenames which don't need to begin with a number part and end with a string part:

bool compareNat(const std::string& a, const std::string& b)
{
    if (a.empty())
        return true;
    if (b.empty())
        return false;
    if (std::isdigit(a[0]) && !std::isdigit(b[0]))
        return true;
    if (!std::isdigit(a[0]) && std::isdigit(b[0]))
        return false;
    if (!std::isdigit(a[0]) && !std::isdigit(b[0]))
    {
        if (std::toupper(a[0]) == std::toupper(b[0]))
            return compareNat(a.substr(1), b.substr(1));
        return (std::toupper(a[0]) < std::toupper(b[0]));
    }

    // Both strings begin with digit --> parse both numbers
    std::istringstream issa(a);
    std::istringstream issb(b);
    int ia, ib;
    issa >> ia;
    issb >> ib;
    if (ia != ib)
        return ia < ib;

    // Numbers are the same --> remove numbers and recurse
    std::string anew, bnew;
    std::getline(issa, anew);
    std::getline(issb, bnew);
    return (compareNat(anew, bnew));
}

Here's a simple test case:

#include <iostream> // std::cout
#include <string>
#include <algorithm> // std::sort, std::copy
#include <iterator> // std::ostream_iterator
#include <sstream> // std::istringstream
#include <vector>
#include <cctype> // std::isdigit

int main()
{
    std::vector<std::string> str;
    str.push_back("20.txt");
    str.push_back("10.txt");
    str.push_back("1.txt");
    str.push_back("z2.txt");
    str.push_back("z10.txt");
    str.push_back("z100.txt");
    str.push_back("1_t.txt");

    std::sort(str.begin(), str.end(), compareNat);
    std::copy(str.begin(), str.end(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

The result is:

1.txt
1_t.txt
10.txt
20.txt
z2.txt
z10.txt
z100.txt
Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
  • I would have used `atoi` instead of a stream, but +1 for getting the full algorithm right. – Mooing Duck Mar 21 '12 at 21:48
  • It produces unexpected results in some cases. For example, for strings of the order: `bcd`, `Baa`, `bbc`, it sorts to: `bbc bcd Baa`. And also in some cases it puts `bcd` before `bbc`. – Jahid Nov 03 '15 at 09:14
  • I solved this issue be replacing the `std::toupper()` function with a user defined `toUpper()` which takes a `std::string` and returns the uppercase of the whole string. – Jahid Nov 03 '15 at 09:17
  • @Jahid: I've corrected the comparison function. Now it makes a case insensitive comparison and works as expected. – Christian Ammer Nov 15 '15 at 20:49
  • For `a`, `A` it sorts to `A`, `a` and for `A`, `a` it sorts to `a`, `A` i.e it reverses the order for uppercase and lowercse version of same word... – Jahid Nov 15 '15 at 22:19
  • Because it's a *case insensitive* sort, it doesn't matter if `a` bevore `A` or vice versa, because in the sorting function `a` equals `A`. If you would preserve the order of equal elements from the input, then you have to use `std::stable_sort`. – Christian Ammer Nov 16 '15 at 10:11
  • In a test run it failed with `segmentation fault`, but when I changed the `int` (`int ia,ib`) to `long long` it worked. This was unexpected and I was looking for another solution and found [this one](http://stackoverflow.com/a/33879726/3744681) which works pretty well. – Jahid Nov 23 '15 at 20:38
10

There is a function that does exactly what you want in glibc. Unfortunately it is C, not C++, so if you can live with that here is the simplest possible solution "out of the box", without reimplementing anything and reinventing the wheel. BTW: this is exactly as ls -lv is implemented. The most important part of it is the versionsort function which does the natural sort for you. It is used here as a comparison function for scandir. The simple example below prints all files/directories in current directory sorted as you wish.

#define _GNU_SOURCE
#include <dirent.h>
#include <stdlib.h>
#include <stdio.h>

int main(void)
{
    struct dirent **namelist;
    int n,i;

    n = scandir(".", &namelist, 0, versionsort);
    if (n < 0)
        perror("scandir");
    else
    {
        for(i =0 ; i < n; ++i)
        {
            printf("%s\n", namelist[i]->d_name);
            free(namelist[i]);
        }
        free(namelist);
    }
    return 0;
}
sirgeorge
  • 6,331
  • 1
  • 28
  • 33
  • 1
    +1 and that vector will be populated from something like this; bonus: this gives the vector already sorted. no more sorts needed!! – Abhinav Mar 21 '12 at 08:15
3

you can use std::sort, splitting your filenames in number + string (both optional).

#include <vector>
#include <string>
#include <algorithm>
#include <iostream>
#include <iterator>
#include <cstring>
using namespace std;

bool compare_filenames(string a, string b)
{
    char *pA, *pB;
    long A = strtol(a.c_str(), &pA, 10),
         B = strtol(b.c_str(), &pB, 10);
    if (A < B)
        return true;
    if (A == B)
        return strcmp(pA, pB);
    return false;
}

int main_compare_filenames(int, char **)
{
    const char *v[] ={
        "1.txt",
        "10.txt",
        "10_t.txt",
        "13.txt",
        "1_t.txt",
        "20.txt",
        "3.txt",
        "4.txt"
    };
    vector<string> t(v, v + 8);
    sort(t.begin(), t.end(), compare_filenames);
    copy(t.begin(), t.end(), ostream_iterator<string>(cout, "\n"));
    return 0;
}

output:

1_t.txt
1.txt
3.txt
4.txt
10_t.txt
10.txt
13.txt
20.txt

That almost works, but there is the problem that the '_' precedes '.', thus a further tweak is needed:

string remove_dot(const char *p)
{
    const char *dot = strchr(p, '.');
    return dot ? string(p, dot - p) : string(p);
}

bool compare_filenames(string a, string b)
{
    char *pA, *pB;
    long A = strtol(a.c_str(), &pA, 10),
         B = strtol(b.c_str(), &pB, 10);
    if (A < B)
        return true;
    if (A == B)
        return remove_dot(pA) < remove_dot(pB);
    return false;
}

output:

1.txt
1_t.txt
3.txt
4.txt
10.txt
10_t.txt
13.txt
20.txt
CapelliC
  • 59,646
  • 5
  • 47
  • 90
1

I have come across an algorithm which works pretty nicely: http://sourcefrog.net/projects/natsort/

I have modified the source a little to meet my needs:

  1. pass std::string as parameter.
  2. use with std::sort and std::vector.

My version of source is here.

A use case example:

#include <iostream>
#include <vector>
#include <algorithm>
#include "strnatcmp.hpp"

int main(){
    std::vector<std::string> files;

    files.push_back("20.txt");
    files.push_back("10.txt");
    files.push_back("1.txt");
    files.push_back("z2.txt");
    files.push_back("z10.txt");
    files.push_back("z100.txt");
    files.push_back("1_t.txt ");
    files.push_back("ABc");
    files.push_back("aBCd");
    files.push_back("aBc");
    files.push_back("aaa");
    files.push_back("aBcd");
    files.push_back("aaA");

    std::sort(files.begin(),files.end(),compareNat);

    for(int i=0;i<(int)files.size();i++)std::cout<< files[i]+"\n";

    return 0;
}

Output:

1.txt
1_t.txt 
10.txt
20.txt
aaa
aaA
ABc
aBc
aBCd
aBcd
z2.txt
z10.txt
z100.txt

This is the strnatcmp.hpp header.

Jahid
  • 21,542
  • 10
  • 90
  • 108