0

I have a very large text file, where most of the lines are composed of ASCII characters, but a small fraction of lines have non-ASCII characters. What is the fastest way to create a new text file containing only the ASCII lines? Right now I am checking each character in each line to see if it's ASCII, and writing each line to the new file if all the characters are ASCII, but this method is rather slow. Also, I am using Python, but would be open to using other languages in the future.

Edit: updated with code

#!/usr/bin/python

import string

def isAscii(s):
    for c in s:
        if ord(c) > 127 or ord(c) < 0:
            return False
    return True

f = open('data.tsv')
g = open('data-ASCII-only.tsv', 'w')

linenumber = 1
for line in f:
    if isAscii(line):
        g.write(line)
    linenumber += 1

f.close()
g.close()
Jessica
  • 2,335
  • 2
  • 23
  • 36
  • possible duplicate of [How to check if a string in Python is in ASCII?](http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii) – Ruben Bermudez Jun 24 '14 at 18:25
  • 3
    The way you're doing it is pretty much what you have to do. All lines must be read; all characters must be checked; and all wanted lines must be written. There aren't really any algorithmic shortcuts here. Show us your code and we can look for inefficiencies. – Tom Zych Jun 24 '14 at 18:27
  • @TomZych: Thanks, I updated the question with code. – Jessica Jun 24 '14 at 18:29
  • @TomZych Trying to encode the whole line to 'ascii' inside a try/except might be faster than checking each character individually. See the question RubenBermudez linked to. – dano Jun 24 '14 at 18:29
  • @dano - Yes, I saw that. Of course, that's still checking all characters. You're just doing it in C. – Tom Zych Jun 24 '14 at 22:36
  • It occurs to me that we cannot answer this question without knowing how the file is encoded. The simple approach of checking for characters < 0x80 will work with UTF-8; it won't work with UTF-7 or UTF-16. You can't even read the line correctly in UTF-16 if you treat it as ASCII; the newline is 0x0A 0x00. – Tom Zych Jun 24 '14 at 23:12

3 Answers3

1

You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.

grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv

See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.

Community
  • 1
  • 1
0

The following suggestion uses a command-line filter (ie, you would use it on the shell command line), this example works in a shell on linux or unix systems, maybe OSX too (I've heard OSX is BSDish):

$ cat big_file | tr -dc '\000-\177' > big_file_ascii_only

It uses the "tr" (translate) filter. In this case, we are telling tr to "delete" all characters which are outside the range octal-000 to octal-177. You may wish to tweak the charcter set - check the man page for tr to get some ideas on other ways to specify the characters you want to keep (or delete)

Brenda J. Butler
  • 1,475
  • 11
  • 20
0

The other approaches given will work if, and only if, the file is encoded in such a way that "non-ASCII" is equivalent to "high bit set", such as Latin-1 or UTF-8. Here's a program in Python 3 that will work with any encoding.

#!/usr/bin/env python3

import codecs

in_fname = "utf16file"
in_encoding = "utf-16"
out_fname = "ascii_lines"
out_encoding = "ascii"

def is_ascii(s):
    try:
        s.encode("ascii")
    except UnicodeEncodeError:
        return False
    return True

f_in = codecs.open(in_fname, "r", in_encoding)
f_out = codecs.open(out_fname, "w", out_encoding)

for s in f_in:
    if is_ascii(s):
        f_out.write(s)

f_in.close()
f_out.close()
Tom Zych
  • 13,329
  • 9
  • 36
  • 53