I am trying to see if there are special characters in csv. This file consist of one column with about 180,000 rows. Since my file contains Korean, English, and Chinese, I added 가-힣``A-Z``0-9
but I do not know what I should to not filter Chinese letters. Or is there any better way to do this?
Special letters I am looking for are : ■ , △, ?, etc
Special letters I do not want to count are : Unit (ex : ㎍, ㎥, ℃)
, ()
, '
etc.
Searching on stackflow, many questions considered designating special letters to find out first. But in my case, that is difficult since I have 180,000 records and I do not know what letters are actually in there. As far as I am concerned, there are only three languages ; Korean, English, and Chinese.
This is my code so far :
with open("C:/count1.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
x=not('가-힣','A-Z','0-9')
if x in line :
sub=re.sub(x,'*',line.rstrip())
count=len(sub)
lst=[fi]+[count]
csv_writer.writerow(lst)
Using import re
regex=not'[가-힣]','[a-z]','[0-9]'
file="C:/kd/fields.csv"
with open("C:/specialcharacter.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
search_target = line
result=re.findall(regex,search_target)
print("\n".join(result))