I needed to do this recursively, and here's what I came up with:
find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done
This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P
What the pieces do:
find -type f
gives a recursive list of filenames with paths relative to current
while read l; do ... done
Bash loop; for each line of the list of file paths, put the path into $l
and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)
iconv -s -f utf-16le -t utf-8 "$l"
Convert the file named in $l
: assume the input file is utf-16 little-endian and convert it to utf-8. The -s
makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.
nl -s "$l: " | cut -c7-
This is a hack: nl
inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I use cut
to strip off the line number, leaving just the filename prefix. (Why I didn't use sed
: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. nl
is much dumber than sed
, and will just take the parameter -s
entirely literally, and the shell handles the escaping for me.)
So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.
Caveats
- This is much, much slower than
grep -R
, because I'm spawning a new copy of iconv
, nl
, cut
, and grep
for every single file. It's horrible.
- Everything that isn't utf-16le input will come out as complete garbage, so if there's a normal ASCII file that contains 'somestring', this command won't report it -- you need to do a normal
grep -R
as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).
- Files whose name happens to contain 'somestring' will show up in the output, even if their contents have no matches.