I have several files exported out of a Sql-Server database as tab-delimited unicode text (numeric data & foreign names).
How do I load these files into R?
There is a special character in the first position of each file. I assume that's the BOM.
Without any options (beyond header
and sep
), read.table
puts all the data into one column, even with the proper \t
separator specification.
Opening a connection and skipping the first three bytes leads to empty beginning of file
.
I've tried all available fileEncoding
values: I get either line 1 didn't have 29 elements
or more columns than column names
.
I'm really stumped on this one. Ideally I'd like to have everything handled within R, but any awk/powershell advice to fix these files is welcome. Is it just a BOM issue? One thing I can't do is control the export from Sql-Server (I don't own the SSIS package). Thanks.
EDIT
For those interested in getting to data columns in a csv exported by SqlServer with the unicode option on: SqlServer starts its csv exports with a NUL character. This link discusses how to actually 'see' those special characters:
Python CSV error: line contains NULL byte
EDIT
This is what I've put together to get ASCII and move along. It's not the answer to my question though.
from os import listdir
from os.path import isfile, join
import unicodedata
import codecs
mypath = "C:\\Users\\Collapsed\\data"
filesOnly = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]
for f in filesOnly:
print f
fln = mypath + "\\" + f
fl_in = codecs.open( fln , 'Ur', 'utf-16')
fl_out = open( mypath + "\\" + 'ascii_' + f , 'w' )
for line in fl_in:
fl_out.write( unicodedata.normalize('NFKD', line ).encode('ascii','ignore') )
fl_in.close()
fl_out.close()