2

I am attempting unsuccessfully to read a *.csv file containing hidden or invisible characters. The file contents are shown here:

my.data2 <- read.table(text = '
Common.name, Scientific.name, Stuff1, Stuff2
Greylag.Goose, Anser.anser, AAC, rr
Snow.Goose, Anser.caerulescens, AAC, rr
Greater.Canada.Goose, Branta.canadensis, AAC, rr
Barnacle.Goose, Branta.leucopsis, AAC, rr
Brent.Goose, Branta.bernicla, AAC, rr
', header = TRUE, sep=',', stringsAsFactors = FALSE)

Note that the above read.table command reads the data correctly. However, read.csv cannot read the file correctly because in many lines there is a hidden character following the second blank space. In some lines there is also a hidden character after the first blank space. In some lines there are no hidden characters. For example:

setwd('c:/users/mmiller21/simple R programs')

my.data <- read.csv('invisible.delimiter2.csv', header = TRUE)
my.data

returns:

            Common.name    Scientific.name Stuff1 Stuff2
1         Greylag.Goose        Anser.anser              
2                   AAC                 rr              
3            Snow.Goose                                 
4    Anser.caerulescens                                 
5                   AAC                 rr              
6  Greater.Canada.Goose  Branta.canadensis    AAC     rr
7        Barnacle.Goose   Branta.leucopsis              
8                   AAC                 rr              
9           Brent.Goose    Branta.bernicla              
10                  AAC                 rr              

More specifically, if I open the *.csv file in Notepad and use the right-arrow key to move the cursor along the first line of data I have to press the right-arrow key twice to move past the first A in AAC.

The following line does not solve the problem:

my.data <- read.csv('invisible.delimiter2.csv', sep=',', header = TRUE)

In my experience tabs are a fairly common hidden character or delimiter. However, I have tried searching for and replacing tabs and that does not help.

I have also tried converting the *.csv file to a *.txt file, but that returns the following:

> my.data3 <- read.table('invisible.delimiter2.txt', sep=',', header = TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
> my.data3
Error: object 'my.data3' not found

I am not familiar with other possible solutions. The file is too large to manually search every space for a hidden character and remove it.

Thank you for any advice on how to read a file like this or on how to find and remove hidden characters prior to reading the file into R.

If it helps, I originally obtained the data by copying a table from Wikipedia. Perhaps that might help identify the hidden character.

EDIT

Thanks to comments below I opened the example data file using gVim 7.3. That software displays the hidden character and reveals it to be ^M. Unfortunately, I have not been able to remove that character from the data file with a simple find and replace within gVim 7.3. If and when I figure out how to remove the ^M I will post the approach here.

Here is a post on how to remove ^M with Perl.

In Perl, how to remove ^M from a file?

Hopefully I can figure out how to remove it with R or a text editor

Here is a link where the example *.csv file is stored.

https://github.com/markwmiller/Rcode/blob/93d07bd2e389e516b6da92017e025a1e97173db0/invisible.delimiter2.csv

and an alternative link to the same file on the same site:

https://github.com/markwmiller/Rcode

Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • Can you upload the first 10-20 rows in a text file so we can play with it? – Brandon Bertelsen Jan 23 '14 at 23:10
  • I am not sure how to upload files. If I find out I will upload some data. – Mark Miller Jan 23 '14 at 23:13
  • You have to use a third party site. Like pastebin or something of that nature. – Brandon Bertelsen Jan 23 '14 at 23:14
  • Hang on - you say `read.table` reads the data okay but `read.csv` doesn't? So umm I think there's a solution here... not sure what it might be... hmmm oh yeah, "just use `read.table`"? – Spacedman Jan 23 '14 at 23:43
  • @Spacedman Perhaps I did not express myself well. When I wrote `the above read.table command reads the data correctly` I meant if one copies and pastes the quoted code into R, R will read the data correctly, but if one attempts to read the external file an error results. Perhaps your suggestion is to copy and paste the entire contents of the large data file directly into R and try reading data that way. That might work, but it seems like a stop-gap solution at best. Hopefully someone might be able to provide a way to efficiently remove hidden characters from a file regardless of file size. – Mark Miller Jan 24 '14 at 00:47
  • @Spacedman I just remembered that I had to extensively edit the data before the above `read.table` worked correctly. In other words, pasting a large chuck of data directly into R and wrapping it with `read.table` is not a viable solution. Nevertheless, thank you for the suggestion. – Mark Miller Jan 24 '14 at 00:58
  • Open your file in some other text editor (not notepad) in which you can identify the mystery character. Or upload it somewhere on the internets so that someone else here might do so. – desired login Jan 24 '14 at 01:04
  • Could the hidden characters be carriage returns? That could explain why there are additional rows being added – JeremyS Jan 24 '14 at 01:11
  • I think you don't really know what "hidden character" means. Start by taking a look at your CSV file in any text editor which can indicate or display tabs and spaces. It's quite possible there are double-tabs in there. – Carl Witthoft Jan 24 '14 at 01:19
  • @CarlWitthoft I will attempt to do so. Any suggestions regarding a free text editor to try? – Mark Miller Jan 24 '14 at 01:28
  • If there are no commas in your text values (as opposed to separator commas) then you can probably read the whole thing in using `scan`, separate on commas using `strsplit`, and reform into a tabular format using `matrix` and `as.data.frame`. Assign types (number, char, factor) to columns and job done. There may be an easier way but unless we get a sample of your file I don't think anyone is going to try... – Spacedman Jan 24 '14 at 08:11
  • @Spacedman If I figure out a way to upload a file I will do so. So far the only suggestion has been pastebin, which seems to want me to paste text into a box for uploading instead of uploading a file from my computer. Pasting text into the box does not seem to preserve the ^M but does destroy the formatting. – Mark Miller Jan 24 '14 at 08:53
  • Share it on a dropbox account? Upload to a github repository? – Spacedman Jan 24 '14 at 08:56
  • @Spacedman I have added a link to the post where people might be able to download the example *.csv file. I am not sure if I have done it correctly, so the file might not be accessible. – Mark Miller Jan 24 '14 at 09:52

3 Answers3

3

In gVim you should be able to remove the ^M characters by typing the following:

:%s/<ctrl>V<ctrl>M//g<return>

If you've typed it in correctly it will look like ':%s/^M//g' in gVim. When you press return, gVim searches (the 's') for what's between the first and second slash and replaces it with what's between the second and third slash, globally (the 'g').

NOTE: If you are on a Windows box and <ctrl>V seems to be pasting text, then gVim may be configured with 'windows behavior'. In that case, use <ctrl>Q<ctrl>M instead of <ctrl>V<ctrl>M.

When I load your sample file into gVim 7.3, it looks like this:

enter image description here

After typing the characters

:%s/<ctrl>V<ctrl>M//g

but BEFORE hitting return I see this:

enter image description here

After hitting return I see this:

enter image description here

You can then do File->Save or File->Save As, which do what you would expect.

John Black
  • 621
  • 4
  • 9
  • I am unclear where I type this. Must I use the menus to open a specific window to type this after importing the data file into gVim? – Mark Miller Jan 27 '14 at 23:18
  • Once you have the file loaded into gVim, make sure you are in command mode (hit escape a couple times if you aren't sure). Then type the character sequence above, starting with the colon, percent, lower-case 's', etc. Press return at the end, and gVim will remove the ^M characters. BTW, hitting VM in sequence is deliberate. I'ts how you enter a control character in gVim. – John Black Jan 28 '14 at 16:49
  • Thank you. I think I am almost there. My current problem is that hitting v pastes in the most recent text string I copied from somewhere ~12 hours ago. I typically copy text outside gVim by hitting c and then paste it with v. How can I eliminate that text string from memory so gVim does not search my file for an unrelated string? – Mark Miller Jan 28 '14 at 18:36
  • That's odd. In gVim, if you are in command mode (hit escape to be sure) V shouldn't paste anything: it should insert a caret symbol into the command you are building. When you hit M you should have the first part of the command from my original post. What version of gVim do you have? Check under Help->Version or Help->About. – John Black Jan 28 '14 at 18:46
  • It says `VIM - Vi IMproved 7.3 (2010 Aug 15, compiled Oct 27 2010 17:59:02) MS-Windows 32-bit GUI version with OLE support Included patches: 1-46` ... `Big version with GUI.` with a lot of additional information. I am using a 64-bit machine. I wonder whether that might cause problems. – Mark Miller Jan 28 '14 at 18:56
  • I don't think the 64-bit vs 32-bit issue is a problem. I've added some screen shots above. Maybe they will help. – John Black Jan 28 '14 at 19:26
  • Thank you. With me the cursor is in the upper left instead of the lower right on the first screenshot. Then after hitting the `esc` key I encounter the problem of text in memory when I hit `v`. I might post a new question with `gVim` as a tag asking how to eliminate the string in memory. – Mark Miller Jan 28 '14 at 19:35
  • 1
    You might have a problem in your _vimrc file. See [here](http://stackoverflow.com/questions/426896/vim-ctrl-v-conflict-with-windows-paste) – John Black Jan 28 '14 at 19:39
  • Thank you. The solution posted there was to use `q` instead of `v`. That solved the problem and removed the `^M` from my file. This alternative might be worth adding to your post. Thanks again for the help. – Mark Miller Jan 28 '14 at 19:59
1

Here's a solution using scan to read the data, matrix to structure it, and data.frame to make it into a data frame:

readF <- function(path, nfields=4){    
    m = matrix(
          gsub(",","",scan(path,what=rep("",nfields))),
              ncol=nfields,byrow=TRUE)
    d = data.frame(m[-1,])
    names(d)=m[1,]
    d
}

So first check the file duplicates your problem :

> read.csv("./invisible.delimiter2.csv")
            Common.name    Scientific.name Stuff1 Stuff2
1         Greylag.Goose        Anser.anser              
2                   AAC                 rr              
3            Snow.Goose                                 
4    Anser.caerulescens                                 
5                   AAC                 rr              
6  Greater.Canada.Goose  Branta.canadensis    AAC     rr
7        Barnacle.Goose   Branta.leucopsis              
8                   AAC                 rr              
9           Brent.Goose    Branta.bernicla              
10                  AAC                 rr        

and then see if my function solves it:

> readF("./invisible.delimiter2.csv")
Read 24 items
           Common.name    Scientific.name Stuff1 Stuff2
1        Greylag.Goose        Anser.anser    AAC     rr
2           Snow.Goose Anser.caerulescens    AAC     rr
3 Greater.Canada.Goose  Branta.canadensis    AAC     rr
4       Barnacle.Goose   Branta.leucopsis    AAC     rr
5          Brent.Goose    Branta.bernicla    AAC     rr

Feel free to pick the function apart to see how it works.

I suspect the source of the problem is that the ^M is in the field data, and because you're fields aren't quoted then R can't tell if its a real line end or one in a field. There's some notes about embedded newlines in quoted fields in the documentation for read.csv etc.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Thank you for the function. I am attempting to get the function to work when there is an empty space instead of a dot in the common name and the scientific name and when `Stuff1` and `Stuff2` contain multiple words, and there is a `^M` at the beginning of the common name and/or at the end of the scientific name and/or at the end of `Stuff2`. If I can modify the function for those scenarios I might post it here. – Mark Miller Jan 28 '14 at 01:23
1

Here is code that can handle white space (i.e., multiple words) within fields:

nfields <- 4

bb <- readLines('c:/users/mmiller21/simple R programs/invisible.delimiter4.csv')
bb

pattern <- "(?<=\\,)(?=)"                  # split on commas
cc <- strsplit(bb, pattern, perl=TRUE)
dd <- unlist(cc)
ee <- dd[dd != ' ' & dd != '' & dd != ','] # remove empty elements
ff <- gsub(",", "", ee)                    # remove commas

m = matrix(ff, ncol=nfields, byrow=TRUE)   # store data in matrix

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
nn <- trim(m)
nn

Here are the contents of the original data set:

Common.name, Scientific.name, Stuff1, Stuff2
Greylag Goose, Anser anser, AAC aa, rr bb
Snow Goose, Anser caerulescens, AAC aa aa, rr bb bb
Greater Canada Goose, Branta canadensis, AAC, rr bb
Barnacle Goose, Branta leucopsis, AAC aa, rr
Brent Goose, Branta bernicla, AAC, rr bb bb bb

I simple removed the dots from the common name and scientific name and added extra text to the third and fourth columns.

Here is the output:

     [,1]                   [,2]                 [,3]        [,4]         
[1,] "Common.name"          "Scientific.name"    "Stuff1"    "Stuff2"     
[2,] "Greylag Goose"        "Anser anser"        "AAC aa"    "rr bb"      
[3,] "Snow Goose"           "Anser caerulescens" "AAC aa aa" "rr bb bb"   
[4,] "Greater Canada Goose" "Branta canadensis"  "AAC"       "rr bb"      
[5,] "Barnacle Goose"       "Branta leucopsis"   "AAC aa"    "rr"         
[6,] "Brent Goose"          "Branta bernicla"    "AAC"       "rr bb bb bb"
Mark Miller
  • 12,483
  • 23
  • 78
  • 132