1

I have a mixed data that is taken mainly from Wikidata, it contains Arabic fields, English fields, and the data in Arabic and English is mixed with other languages such as Greek characters, Hindi numbers, Russian maybe, emojis, Cyrillic letter:

Example of possible data:

ڱ
١٩٣٣
Белгородский государственный университет

The sample above shows like ? in my database, I tried to use different collation choices like

ARABIC_CI_AS
Arabic_CI_AI_KS
Arabic_100_CI_AI_KS_WS

My table schema:

CREATE TABLE [dbo].[NED_Corpus_True_Expanded]
(
    [id] [int] IDENTITY(1,1) NOT NULL,
    [Wikidata_IQ] [nvarchar](50) NULL,
    [WikidataType] [nvarchar](500) NULL,
    [WikidataArabicLabel] [nvarchar](max) NULL,
    [WikidataEnglishLabel] [nvarchar](max) NULL,
    [WikidataAraliases] [nvarchar](max) NULL,
    [WikidataEnaliases] [nvarchar](max) NULL,
    [WikidataArabicDescrption] [nvarchar](max) NULL,
    [WikidataEnglishDescrption] [nvarchar](max) NULL,
    [WikidataMainCategory] [nvarchar](max) NULL,
    [WikipediaArabicLabel] [nvarchar](max) NULL,
    [WikipediaArabicDescrption] [nvarchar](max) NULL,
    [Label] [float] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

But non-works for the mixed data, here is a screenshot of what the data looks like:

Data from SQL Server table

I inserted the data from CSV into SQL table using the following script:

import pyodbc
import pandas as pd
# insert data from csv file into dataframe.
# working directory for csv file: type "pwd" in Azure Data Studio or Linux
# working directory in Windows c:\users\username
df = pd.read_csv("ned_true_new.csv")
df.fillna("", inplace=True)
df.info()
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
 
#password = 'yourpassword' 
#cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username)
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=srvr;"
                      "Database=db;"
                      "Trusted_Connection=yes;")
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     #print(row)
     cursor.execute('''INSERT INTO [dbo].[NED_Corpus_True_Expanded]
           ([Wikidata_IQ]
           ,[WikidataType]
           ,[WikidataArabicLabel]
           ,[WikidataEnglishLabel]
           ,WikidataAraliases
           ,WikidataEnaliases
           ,[WikidataArabicDescrption]
           ,[WikidataEnglishDescrption]
           ,[WikidataMainCategory]
           ,[WikipediaArabicLabel]
           ,[WikipediaArabicDescrption]
           ,[Label])
     VALUES (?,?,?,?,?,?,?,?,?,?,?,?)''', row.id, row.type, row.arlabel,row.enlabel,row.araliases, row.enaliases,row.ardescription, row.endescription, row.maincategory, row.arwiki,row.wikiDescrption,row.Label )
cnxn.commit()
cursor.close()
Mai
  • 121
  • 1
  • 10
  • See the following link. I know that you're using nvarchar but it deals with the same need. https://stackoverflow.com/questions/48368161/what-sql-collation-should-i-use-to-store-multi-language-character –  Mar 27 '22 at 08:14
  • How do you insert the data in your database? – Wouter Mar 27 '22 at 08:23
  • I referred to the post, but I am already using the varchar?!!! the problem is still there – Mai Mar 27 '22 at 08:24
  • I inserted the data using python script : – Mai Mar 27 '22 at 08:25
  • Are you using `nvarchar` source to populate the table? I see no problem with `select N'ڱ' t1, N'١٩٣٣' t2, N'Белгородский государственный университет' t3` for example. – Serg Mar 27 '22 at 08:32
  • so what is the problem? – Mai Mar 27 '22 at 08:34

1 Answers1

0

I want to post the solution for the community to make use of, I created a new database with the default collation, and I use the same schema (nvarchar) with the same code for importing the data, and the default collation is suitable for all kinds of data languages.

Mai
  • 121
  • 1
  • 10