0

I am trying to remove non-ascii characters from a large text file. I googled around and found the following tr command, which works perfectly. However, I wish to use awk, because this command will go into my existing awk script. Would appreciate any help!

tr -cd '\11\12\15\40-\176' < InputFile > OutputFile
Inian
  • 80,270
  • 14
  • 142
  • 161
rogerwhite
  • 335
  • 4
  • 16
  • That doesn't just remove non-ASCII characters, it removes some ASCII characters too. In particular it'll remove ASCII chars \00-\10, \13, \14, \16-\39, and \177. I feel like your goal isn't really to delete non-ASCII chars but something else and there may be a POSIX character class (or combination of such) already existing for it. Maybe you want to delete all chars in the `[:cntrl:]` character class? if you tell us what you're really trying to do and provide a [mcve] with concise, testable sample input and expected output then we can help. – Ed Morton May 21 '20 at 15:14
  • Thanks Ed. The mission is to clean some text files, and upload to AWS cloud. Then use S3 select query to search the data. Now, the trouble with the AWS Select is that, as soon as it sees a non UTF8 character in the file, it will throw an error... my text files have loads of junk. Anyway if there is a non UTF8 char, then most likely I dont need it. hence, I can get rid of it, before uploading to AWS – rogerwhite May 22 '20 at 05:25
  • UTF8 encoding uses 8 bits and so can store 256 characters. ASCII characters use 7 bits (hence there are 128 of them) and so they can be stored as UTF8 and leave room for another 128 characters (e.g. for encoding characters with accents). You're script is deleting all but a subset of ASCII characters. You can't reliably test a given file to see if it's UTF8 encoded or UTF16 or something else (`file` will guess) and you can't tell if a given byte sequence in a file is a UTF8-encoded x or a some-other-encoded y (where x and y are some characters). So, idk how you could do what you say you want. – Ed Morton May 22 '20 at 13:03
  • Take a look at https://unix.stackexchange.com/q/11602/133219 and https://stackoverflow.com/q/19212306/1745001 for more info on Unicode encodings and how ASCII fits into them. – Ed Morton May 22 '20 at 13:08

2 Answers2

1

Try gsub with the same octal escapes in a bracket expression:

gsub(/[^\11\12\15\40-\176]/,"")
  • Thanks user. but for some reason this code still isnt foolproof. when I test it, it leaves out some special character. this command is part of my larger awk script. hence, even one non-ascii character kills the rest of the script.. the mentioned "tr -cd" still works well. Is there a way to change the gsub code, maybe expand it a bit, and make it comprehensive – rogerwhite May 22 '20 at 04:56
0

Per the gawk manual to match ASCII or non-ASCII characters you can use a range:

you can simulate such a construct using [\x00-\x7F]. This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list ([^\x00-\x7F]) to match any single-byte characters that are not in the ASCII range.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185