0

I am trying to see if there are special characters in csv. This file consist of one column with about 180,000 rows. Since my file contains Korean, English, and Chinese, I added 가-힣``A-Z``0-9 but I do not know what I should to not filter Chinese letters. Or is there any better way to do this?

Special letters I am looking for are : ■ , △, ?, etc

Special letters I do not want to count are : Unit (ex : ㎍, ㎥, ℃), (), ' etc.

Searching on stackflow, many questions considered designating special letters to find out first. But in my case, that is difficult since I have 180,000 records and I do not know what letters are actually in there. As far as I am concerned, there are only three languages ; Korean, English, and Chinese.

This is my code so far :

with open("C:/count1.csv",'w',encoding='cp949',newline='') as testfile:        
    csv_writer=csv.writer(testfile)
    with open(file,'r') as fi:
            for line in fi:
                x=not('가-힣','A-Z','0-9')
                if x in line :
                    sub=re.sub(x,'*',line.rstrip())
                count=len(sub)
                lst=[fi]+[count]
                csv_writer.writerow(lst)

Using import re

regex=not'[가-힣]','[a-z]','[0-9]'

file="C:/kd/fields.csv"
with open("C:/specialcharacter.csv",'w',encoding='cp949',newline='') as testfile: 
    csv_writer=csv.writer(testfile)
    with open(file,'r') as fi:
            for line in fi:
                search_target = line
                result=re.findall(regex,search_target)
                print("\n".join(result))
Do Hun Kim
  • 97
  • 15
  • I suggest you look at their ascii or other code, as in this post: https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python. Then, you will most likely have to manually parse through the characters you do or do not want. – Landmaster Aug 08 '17 at 04:40
  • @Landmaster hmm if I'm not mistaken, your link shows changing letters to int value and adding them up, right? I just need to count occurence of special characters. – Do Hun Kim Aug 08 '17 at 05:01
  • Oh yes yes, the post shows you how to add them, but I'm saying you can convert the characters to unicode or ascii as those are exhaustive, I believe. – Landmaster Aug 08 '17 at 05:08
  • Conversely, you could put all characters in a dictionary and include the action in the dictionary, say {'a': True, '#': False} and then use the True or False to indicate whether you should filter or not. – Landmaster Aug 08 '17 at 05:09
  • @Landmaster Ok. I get it. I will try that method too – Do Hun Kim Aug 08 '17 at 05:15

1 Answers1

0

I do not know why you consider not filtering chinese characters when you are only looking for some special letters. This library can filter chinese.

  1. filter Chinese on top of your filtered list of Korean, English and number: regex = "[^가-힣a-zA-Z0-9]" result=re.findall(regex,search_target)
  2. filter either 1) a list of special characters that you seek or 2) a list of special characters you want to avoid.

Choose wisely which fits your case better to avoid as much exceptions as possible so that you do not have to add more filters everytime.

Make the list as regex.

Then, loop through your 180,000 rows using regex to filter out the rows.

Update your regex-list until you filter everything.

Jisu Hong
  • 724
  • 1
  • 7
  • 22
  • Thank you. As I have mentioned above, your first suggestion is time consuming. I think second suggestion is better for me which to count every letters except number, Korean, English, and Chinese. I will try – Do Hun Kim Aug 08 '17 at 04:58
  • Either way, leave anything but special letters first by filtering number, Korean, English, and Chinese. – Jisu Hong Aug 08 '17 at 05:00
  • Then you can choose to make a list of what you are looking for: ■ , △, ?, etc. or a list of what you do not want to count: Unit (ex : ㎍, ㎥, ℃), (), ' etc. – Jisu Hong Aug 08 '17 at 05:02
  • To code, I just need to add `Zhon` right? Since it is a separate library, I can't write in one line. – Do Hun Kim Aug 08 '17 at 05:08
  • you need to install zhon http://zhon.readthedocs.io/en/latest/#installation – Jisu Hong Aug 08 '17 at 05:10
  • Then you can do re.findall('[%s]' % zhon.hanzi.characters, 'unfilterer_string') to filter – Jisu Hong Aug 08 '17 at 05:11
  • Okay. I will try it after I am done with refining my dataset. Thanks again – Do Hun Kim Aug 08 '17 at 05:14
  • How can I put `x=not('가-힣','A-Z','1-9')` as a string? When I try my code above, I get `TypeError : 'in' requires string as a left operand, not bool` – Do Hun Kim Aug 09 '17 at 05:27
  • Why do you need it as a string? It's a regex. – Jisu Hong Aug 09 '17 at 05:29
  • I get an `TypeError` when I try my code. I did not add Zhon part yet. – Do Hun Kim Aug 09 '17 at 05:31
  • Do you know what regex is? – Jisu Hong Aug 09 '17 at 05:44
  • I edited my question to show what I tried using regex. I googled it to find what it is but still don't know how I can filter special characters. – Do Hun Kim Aug 09 '17 at 05:58
  • You are almost correct. regex = "[^가-힣a-zA-Z0-9]" result=re.findall(regex,search_target) – Jisu Hong Aug 09 '17 at 06:23
  • Oh nice. I got it but just one more question. Unless you add ' ' or '\n' into regex, it looks like it filter them too along with special characters, right? – Do Hun Kim Aug 09 '17 at 06:58
  • \n will be filtered because you are reading by line not because of the regex. However, space will not be filtered unless you add it to the constraint. – Jisu Hong Aug 09 '17 at 07:50