The top-rated several answers are all good and correct in their own ways, I write here to add more information, context, and perspective.
For clarity, let us consider that string A contains string B if there is any subsequence of codepoints in A which is equal to B. If we accept this, the problem is reduced to the question of whether two strings are equal.
The question of when strings are equal has been considered in detail for many decades. Much of the present state of knowledge is encapsulated in SQL collations. Unicode normal forms are close to a proper subset of this. But there is more beyond even SQL collations.
For example, in SQL collations, you can be
Strictly binary sensitive - so that different Unicode normalisation forms (e.g. precombined or combining accents) compare differently.
For example, é
can be represented as either U+00e9
(precombined) or U+0065 U+0301
(e with combining acute accent).
Are these the same or different?
Unicode normalised - In this case the above examples would be equal to each other, but not to É
or e
.
accent insensitive, (for e.g. Spanish, German, Swedish etc. text). In this case U+0065
= U+0065 U+0301
= U+00e9
= é
= e
case and accent insensitive, so that (for e.g. Spanish, German, Swedish etc. text). In this case U+00e9
= U+0065 U+0301
= U+00c9
= U+0045 U+0301
= U+0049
= U+0065
= E
= e
= É
= é
Kanatype sensitive or insensitive, i.e. you can consider Japanese Hiragana and Katakana as equivalent or different. The two syllabaries contain the same number of characters, organised and pronounced in the (mostly) the same way, but written differently and used for different purposes. For example katakana are used for loan words or foreign names, but hiragana are used for children's books, pronunciation guides (e.g. rubies), and where there is no kanji for a word (or perhaps where the writer does not know the kanji, or thinks the reader may not know it).
Full-width or half-width sensitive - Japanese encodings include two representations of some characters for historical reasons - they were displayed at different sizes.
Ligatures considered equivalent or not: See https://en.wikipedia.org/wiki/Ligature_(writing)
Is æ
the same as ae
or not? They have different Unicode encodings, as do accented characters, but unlike accented characters they also look different.
Which brings us to...
Arabic presentation form equivalence
Arabic writing has a culture of beautiful calligraphy, where particular sequences of adjacent letters have specific representations. Many of these have been encoded in the Unicode standard. I don't fully understand the rules, but they seem to me to be analogous to ligatures.
Other scripts and systems: I have no knowledge whatsoever or Kannada, Malayalam, Sinhala, Thai, Gujarati, Tibetan, or almost all of the tens or hundreds of scripts not mentioned. I assume they have similar issues for the programmer, and given the number of issues mentioned so far and for so few scripts, they probably also have additional issues the programmer ought to consider.
That gets us out of the "encoding" weeds.
Now we must enter the "meaning" weeds.
is Beijing
equal to 北京
? If not, is Bĕijīng
equal to 北京
? If not, why not? It is the Pinyin romanisation.
Is Peking
equal to 北京
? If not, why not? It is the Wade-Giles romanisation.
Is Beijing
equal to Peking
? If not, why not?
Why are you doing this anyway?
For example, if you want to know if it is possible that two strings (A and B) refer to the same geographical location, or same person, you might want to ask:
Could these strings be either Wade-Giles or Pinyin representations of a set of sequences of Chinese characters? If so, is there any overlap between the corresponding sets?
Could one of these strings be a Cyrillic transcription of a Chinese Character?
could one of these strings be a Cyrillic transliteration of the Pinyin romanisation?
Could one of these strings be a Cyrillic transliteration of a Pinyin romanisation of a Sinification of an English name?
Clearly these are difficult questions, which don't have firm answers, and in any case, the answer may be different according to the purpose of the question.
To finish with a concrete example.
- If you are delivering a letter or parcel, clearly
Beijing
, Peking
, Bĕijīng
and 北京
are all equal. For that purpose, they are all equally good. No doubt the Chinese post-offices recognise many other options, such as Pékin
in French, Pequim
in Portuguese, Bắc Kinh
in Vietnamese, and Бээжин
in Mongolian.
Words do not have fixed meanings.
Words are tools we use to navigate the world, to accomplish our tasks, and to communicate with other people.
While it looks like it would be helpful if words like equality
, Beijing
, or meaning
had fixed meanings, the sad fact is they do not.
Yet we seem to muddle along somehow.
TL;DR: If you are dealing with questions relating to reality, in all its nebulosity (cloudiness, uncertainty, lack of clear boundaries), there are basically three possible answers to every question:
- Probably
- Probably not
- Maybe