Reading tab delimited unicode data into R via read.table

Question

I have several files exported out of a Sql-Server database as tab-delimited unicode text (numeric data & foreign names).

How do I load these files into R?

There is a special character in the first position of each file. I assume that's the BOM.

Without any options (beyond header and sep), read.table puts all the data into one column, even with the proper \t separator specification.

Opening a connection and skipping the first three bytes leads to empty beginning of file.

I've tried all available fileEncoding values: I get either line 1 didn't have 29 elements or more columns than column names.

I'm really stumped on this one. Ideally I'd like to have everything handled within R, but any awk/powershell advice to fix these files is welcome. Is it just a BOM issue? One thing I can't do is control the export from Sql-Server (I don't own the SSIS package). Thanks.

EDIT

For those interested in getting to data columns in a csv exported by SqlServer with the unicode option on: SqlServer starts its csv exports with a NUL character. This link discusses how to actually 'see' those special characters:

Python CSV error: line contains NULL byte

EDIT

This is what I've put together to get ASCII and move along. It's not the answer to my question though.

from os import listdir
from os.path import isfile, join
import unicodedata
import codecs

mypath = "C:\\Users\\Collapsed\\data"

filesOnly = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]

for f in filesOnly:
     print f
     fln = mypath + "\\" + f
     fl_in  = codecs.open( fln , 'Ur', 'utf-16')
     fl_out = open( mypath + "\\" + 'ascii_' + f , 'w' ) 
     for line in fl_in:
         fl_out.write( unicodedata.normalize('NFKD', line ).encode('ascii','ignore') )
     fl_in.close()
     fl_out.close()

You skip the first line when reading e.g. `read.table(file,skip=1)`, if it did not work post an example of your file — iTech, Mar 02 '13 at 19:43
I get `empty beginning of file` without `fileEncoding`, and `more columns than column names` using `fileEncoding`. I tried with `header=T` and 'header=F'. The name fields are persons' names: can't post. — user2105469, Mar 02 '13 at 19:51
I tried [link](http://stackoverflow.com/questions/1068650/using-awk-to-remove-the-byte-order-mark) but `read.table` still collapses all columns into one. If it can help, Excel reads the data without any problem (with and without the `awk` BOM cleanup above). — user2105469, Mar 02 '13 at 20:12

Reading tab delimited unicode data into R via read.table

0 Answers0