2

I have a std::wstring fName filename for which I'd like to test if it has a .txt extension. This works:

return ((fName.length() >= 4) && (0 == fName.compare(fName.length() - 4, 4, L".txt")));

but it's case sensitive, which I don't want: I need blah.tXt and hello.TXT to be both accepted.


This should work as a case-insensitive version:

std::wstring ext = L".txt";
wstring::const_iterator it = std::search(fName.end() - 4, fName.end(), ext.begin(), ext.end(), 
                               [](wchar_t ch1, wchar_t ch2) { return tolower(ch1) == ch2; }); 
                    // no need tolower(ch2) because the pattern .txt is already lowercase
return (it != str1.end());

but the std::search is probably far from optimal because it searches if it contains a pattern (anywhere in the origin string), and here I only need to compare character by character.


As I need to test this for millions of filenames, how can I improve the performance to check if a filename has an extension (case-insensitive) .txt ?

I don't want the easy solution :

  • let's lowercase the fName in a new variable (or even lowercase just the 4 last char of fName)

  • then compare

because this would require new variables, memory, etc. Can I compare in place, with a custom predicate [](wchar_t ch1, wchar_t ch2) { return tolower(ch1) == ch2; }) ?


Note: I'm not looking for Boost solutions, nor solutions like this one Case insensitive string comparison in C++ or many similar questions which are not optimized for performance.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • Why not use the first variant, but instead of `std::wstring::compare` use `std::equal` with your custom predicate? – fghj Jul 20 '17 at 00:18
  • `return ((fName.length() >= 4) && (0 == fName.compare(fName.length() - 4, 4, L".txt")));` What if the extension has more than 3 characters? Your OS should already have functions that takes a file name and gives you the appropriate parts of the name (ex. the Windows `Pathxxx` functions). No need to get tripped up with corner cases. – PaulMcKenzie Jul 20 '17 at 01:25
  • 1
    @user1034749 Do you mean `std::equal(fName.end() - ext.length(), fName.end(), ext.begin(), [](wchar_t ch1, wchar_t ch2) { return tolower(ch1) == ch2; })` ? This seems to be the solution indeed! – Basj Jul 20 '17 at 08:11

4 Answers4

0

How about this?

#include <string>
#include <algorithm>

template<typename CharT>
bool HasExtension(const std::basic_string<CharT>& fileName, const std::basic_string<CharT>& ext)
{
    auto b = fileName.begin() + fileName.length() - ext.length();
    auto a = ext.begin();

    while (b != fileName.end())
    {
        if (*a++ != tolower(*b++))
        {
             return false;
        }
    }
    return true;
}


int  main()
{
    std::string ext{".Txt"}; // make sure this is a lower case std::string.
    std::transform(ext.begin(), ext.end(), ext.begin(), tolower);  

    std::string fn{"test.txt"};

   return HasExtension(fn, ext) ? 0 : 1;
}
Michaël Roy
  • 6,338
  • 1
  • 15
  • 19
0

A suggested solution would be

#include <iostream>
#include <string>

bool isTXT(const std::wstring& str)
{
    std::wstring::size_type idx;
    idx = str.rfind('.');
    if( idx != std::wstring::npos ){
        std::wstring ext = str.substr(idx+1);
        if( ext == L"txt" || ext == L"TXT" ) // do all possible combinations.
            return true;
    }
    return false;
}

int main()
{
    std::wstring fileName = L"haihs.TXT";
    std::wcout << isTXT(fileName) << std::endl;

    return 0;
}

For the conditional statement ext == L"txt" || ext == L"TXT", you can fill out the rest if you don't want to create a wstring to convert it to lower or upper case.

CroCo
  • 5,531
  • 9
  • 56
  • 88
-1

As suggested in @fghj's comment, this is a nice solution:

std::equal(fName.end() - ext.length(), fName.end(), ext.begin(),
           [](wchar_t ch1, wchar_t ch2) { return tolower(ch1) == ch2; });
Basj
  • 41,386
  • 99
  • 383
  • 673
-2

If you want an implementation without assumptions (that also doesn't assume the length of the extension, but assumes that the file has a name at least 4 chars in size):

char * testing = &fName[fName.length() - 4];
unsigned int index = 1;
unsigned int total = 0;
while(index < 4) {
    total += testing[index] << index;
    ++index;
}
return total == ('t' << 1) + ('x' << 2) + ('t' << 3) || total == ('T' << 1) + ('X' << 2) + ('T' << 3);

This is quite optimal, but assumes that the sum of the ASCII values of other extensions won't match the sum of the ascii values of the .txt extension (I also assumed that the extension will have 3 chars, like you did above):

int index = fName.length();
int total = fName[--index] + fName[--index] + fName[--index];
return total == 't' + 'x' + 't' || 'T' + 'X' + 'T';

This is a messier version of what is above, but should be faster:

return *((int*)&fName[index - 4]) == '.' + 't' + 'x' + 't';

You can optimize this even further if you know that none of the other extensions will end with a "t", have an "x" in the middle, etc. by doing something like this:

return fName[fName.length() - 1] == 't' || 'T;
Cpp plus 1
  • 990
  • 8
  • 25
  • Both of these approaches lead to undefined behavior. Also, does this handle case insensitivity? – templatetypedef Jul 20 '17 at 00:47
  • 1
    `assumes that the sum of the ASCII values of other extensions won't match the sum of the ascii values of .txt`: I don't think this is a valid assumption: `.txt` has same ascii char sum than `.tys`... – Basj Jul 20 '17 at 07:42