7

Detect if there is any non-ASCII character in a file path

I have a Unicode string with UTF-8 encoding that stores the file path, like, for instance, C:\Users\myUser\Downloads\ü.pdf. I have already checked that the string holds a correct file path in the local file system, but since I'm sending this string to a different process that supports only ASCII I need to figure out if that string contains any non-ASCII character.

How can I do that?

FrankS101
  • 2,112
  • 6
  • 26
  • 40
  • 1
    Convert to ASCII, convert back to UTF-8 then compare the original string with the one that has been converted twice. If compare is successful send the ASCII string. – Richard Critten Jan 11 '18 at 17:44

2 Answers2

8

An ASCII character uses only the lower 7 bits of a char (values 0-127). A non-ASCII Unicode character encoded in UTF-8 uses char elements that all have the upper bit set. So, you can simply iterate the char elements seeing if any of them has a value above 127, eg:

bool containsOnlyASCII(const std::string& filePath) {
  for (auto c: filePath) {
    if (static_cast<unsigned char>(c) > 127) {
      return false;
    }
  }
  return true;
}

A note on the cast: std::string contains char elements. The standard doesn't define whether char is signed or unsigned. If it's signed, then we can cast it to unsigned in a well-defined way. The standard specifies exactly how this is done.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
7

As suggested by several comments and highlighted by @CrisLuengo answer, we can iterate the characters looking for any in the upper bit set (live example):

#include <iostream>
#include <string>
#include <algorithm>

bool isASCII (const std::string& s)
{
    return !std::any_of(s.begin(), s.end(), [](char c) { 
        return static_cast<unsigned char>(c) > 127; 
    });
}

int main()
{
    std::string s1 { "C:\\Users\\myUser\\Downloads\\Hello my friend.pdf" };   
    std::string s2 { "C:\\Users\\myUser\\Downloads\\ü.pdf" };

    std::cout << std::boolalpha << isASCII(s1) << "\n";
    std::cout << std::boolalpha << isASCII(s2) << "\n";
}

true

false

FrankS101
  • 2,112
  • 6
  • 26
  • 40
  • 2
    Even though this may not be the solution, that function can be shortened to just `return std::all_of(filepath.begin(), filepath.end(), ::isprint);` – PaulMcKenzie Jan 11 '18 at 17:49
  • 1
    @1201ProgramAlarm https://stackoverflow.com/questions/21805674/do-i-need-to-cast-to-unsigned-char-before-calling-toupper The casting is there to avoid undefined behavior due to a negative value, although in this particular case will never happen – FrankS101 Jan 11 '18 at 17:58
  • @PaulMcKenzie You're right, that would be shorter, but why this may not be the solution? A counterexample would be helpful. – FrankS101 Jan 11 '18 at 18:01
  • 2
    Be aware that the behavior of `isprint` depends on the current C locale. If someone changes the locale, this will no longer be doing a check for "printable ASCII." At a minimum, I would change the name of the function to avoid confusion. – Adrian McCarthy Jan 11 '18 at 18:23
  • I would probably use something like `bool isASCII = std::all_of(filepath.begin(), filepath.end(), [](char c){ return static_cast(c) <= 127; });` or `bool isASCII = !std::any_of(filepath.begin(), filepath.end(), [](char c){ return static_cast(c) > 127; });` – Remy Lebeau Jan 11 '18 at 20:55
  • So NUL,SOH **'\x01\02\x03"** are ASCII character per your code ? – Haseeb Mir Sep 29 '21 at 16:06