I am attempting unsuccessfully to read a *.csv file containing hidden or invisible characters. The file contents are shown here:
my.data2 <- read.table(text = '
Common.name, Scientific.name, Stuff1, Stuff2
Greylag.Goose, Anser.anser, AAC, rr
Snow.Goose, Anser.caerulescens, AAC, rr
Greater.Canada.Goose, Branta.canadensis, AAC, rr
Barnacle.Goose, Branta.leucopsis, AAC, rr
Brent.Goose, Branta.bernicla, AAC, rr
', header = TRUE, sep=',', stringsAsFactors = FALSE)
Note that the above read.table
command reads the data correctly. However, read.csv cannot read the file correctly because in many lines there is a hidden character following the second blank space. In some lines there is also a hidden character after the first blank space. In some lines there are no hidden characters. For example:
setwd('c:/users/mmiller21/simple R programs')
my.data <- read.csv('invisible.delimiter2.csv', header = TRUE)
my.data
returns:
Common.name Scientific.name Stuff1 Stuff2
1 Greylag.Goose Anser.anser
2 AAC rr
3 Snow.Goose
4 Anser.caerulescens
5 AAC rr
6 Greater.Canada.Goose Branta.canadensis AAC rr
7 Barnacle.Goose Branta.leucopsis
8 AAC rr
9 Brent.Goose Branta.bernicla
10 AAC rr
More specifically, if I open the *.csv file in Notepad and use the right-arrow key to move the cursor along the first line of data I have to press the right-arrow key twice to move past the first A
in AAC
.
The following line does not solve the problem:
my.data <- read.csv('invisible.delimiter2.csv', sep=',', header = TRUE)
In my experience tabs are a fairly common hidden character or delimiter. However, I have tried searching for and replacing tabs and that does not help.
I have also tried converting the *.csv file to a *.txt file, but that returns the following:
> my.data3 <- read.table('invisible.delimiter2.txt', sep=',', header = TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 4 elements
> my.data3
Error: object 'my.data3' not found
I am not familiar with other possible solutions. The file is too large to manually search every space for a hidden character and remove it.
Thank you for any advice on how to read a file like this or on how to find and remove hidden characters prior to reading the file into R.
If it helps, I originally obtained the data by copying a table from Wikipedia. Perhaps that might help identify the hidden character.
EDIT
Thanks to comments below I opened the example data file using gVim 7.3. That software displays the hidden character and reveals it to be ^M
. Unfortunately, I have not been able to remove that character from the data file with a simple find and replace within gVim 7.3. If and when I figure out how to remove the ^M
I will post the approach here.
Here is a post on how to remove ^M
with Perl.
In Perl, how to remove ^M from a file?
Hopefully I can figure out how to remove it with R or a text editor
Here is a link where the example *.csv file is stored.
and an alternative link to the same file on the same site: