What SQL Server collation to use for mixed languages (Arabic, English, Greek Characters, Hindi numbers, etc)

Question

I have a mixed data that is taken mainly from Wikidata, it contains Arabic fields, English fields, and the data in Arabic and English is mixed with other languages such as Greek characters, Hindi numbers, Russian maybe, emojis, Cyrillic letter:

Example of possible data:

ڱ
١٩٣٣
Белгородский государственный университет

The sample above shows like ? in my database, I tried to use different collation choices like

ARABIC_CI_AS
Arabic_CI_AI_KS
Arabic_100_CI_AI_KS_WS

My table schema:

CREATE TABLE [dbo].[NED_Corpus_True_Expanded]
(
    [id] [int] IDENTITY(1,1) NOT NULL,
    [Wikidata_IQ] [nvarchar](50) NULL,
    [WikidataType] [nvarchar](500) NULL,
    [WikidataArabicLabel] [nvarchar](max) NULL,
    [WikidataEnglishLabel] [nvarchar](max) NULL,
    [WikidataAraliases] [nvarchar](max) NULL,
    [WikidataEnaliases] [nvarchar](max) NULL,
    [WikidataArabicDescrption] [nvarchar](max) NULL,
    [WikidataEnglishDescrption] [nvarchar](max) NULL,
    [WikidataMainCategory] [nvarchar](max) NULL,
    [WikipediaArabicLabel] [nvarchar](max) NULL,
    [WikipediaArabicDescrption] [nvarchar](max) NULL,
    [Label] [float] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

But non-works for the mixed data, here is a screenshot of what the data looks like:

I inserted the data from CSV into SQL table using the following script:

import pyodbc
import pandas as pd
# insert data from csv file into dataframe.
# working directory for csv file: type "pwd" in Azure Data Studio or Linux
# working directory in Windows c:\users\username
df = pd.read_csv("ned_true_new.csv")
df.fillna("", inplace=True)
df.info()
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
 
#password = 'yourpassword' 
#cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username)
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=srvr;"
                      "Database=db;"
                      "Trusted_Connection=yes;")
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     #print(row)
     cursor.execute('''INSERT INTO [dbo].[NED_Corpus_True_Expanded]
           ([Wikidata_IQ]
           ,[WikidataType]
           ,[WikidataArabicLabel]
           ,[WikidataEnglishLabel]
           ,WikidataAraliases
           ,WikidataEnaliases
           ,[WikidataArabicDescrption]
           ,[WikidataEnglishDescrption]
           ,[WikidataMainCategory]
           ,[WikipediaArabicLabel]
           ,[WikipediaArabicDescrption]
           ,[Label])
     VALUES (?,?,?,?,?,?,?,?,?,?,?,?)''', row.id, row.type, row.arlabel,row.enlabel,row.araliases, row.enaliases,row.ardescription, row.endescription, row.maincategory, row.arwiki,row.wikiDescrption,row.Label )
cnxn.commit()
cursor.close()

See the following link. I know that you're using nvarchar but it deals with the same need. https://stackoverflow.com/questions/48368161/what-sql-collation-should-i-use-to-store-multi-language-character — , Mar 27 '22 at 08:14
I referred to the post, but I am already using the varchar?!!! the problem is still there — Mai, Mar 27 '22 at 08:24
Are you using `nvarchar` source to populate the table? I see no problem with `select N'ڱ' t1, N'١٩٣٣' t2, N'Белгородский государственный университет' t3` for example. — Serg, Mar 27 '22 at 08:32

score 0 · Accepted Answer · answered Mar 27 '22 at 15:33

0

I want to post the solution for the community to make use of, I created a new database with the default collation, and I use the same schema (nvarchar) with the same code for importing the data, and the default collation is suitable for all kinds of data languages.

answered Mar 27 '22 at 15:33

Mai

121
1
10

what is the default collation you had ?! your solution not clear – shareef Nov 18 '22 at 08:06
SQL_Latin1_General_CP1_CI_AS is the default collation – Mai Nov 21 '22 at 09:39

What SQL Server collation to use for mixed languages (Arabic, English, Greek Characters, Hindi numbers, etc)

1 Answers1