1

I am trying to import a CSV file into R using the read.table command. I keep getting the error message "more columns than column names", even though I have set the strip.white to TRUE. The program that makes the csv files adds a large number of comma characters to the end of each line, which I think is the source of the extra columns.

read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T, 
           as.is=T,row.names = NULL, quote = "")

How can I get R to strip away the extraneous columns of commas from the header line and from the rest of the CSV file as it reads it into the R console?

Also, numerous cells in the csv file do not contain any data. Is it possible to get R to fill in these empty cells with "NA"?

The first two lines of the csv file:

Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Charles
  • 50,943
  • 13
  • 104
  • 142
gwilymh
  • 415
  • 1
  • 7
  • 20
  • May I suggest using `read.csv` as it's a wrapper for `read.able` that is preset to deal with csv files specifically. – Tyler Rinker Dec 21 '12 at 20:03
  • 2
    Missing values will be populated with NA by default. If you can't fix the generation of this badly-formed file, you may have to read it with `header=FALSE, skip=1` and supply column names after the fact. – Matthew Lundberg Dec 21 '12 at 20:15
  • And, after doing what @MatthewLundberg suggests, drop the bogus all-`NA` columns (i.e., `mydat <- mydat[!sapply(mydat,function(x) all(is.na(x)))]`). If the CSV creator is generating *different* numbers of bogus fields on different lines, you may be in trouble ... – Ben Bolker Dec 21 '12 at 20:19
  • @MatthewLundberg: NA's are only assigned automatically for blanks for logical, integer, numeric and complex columns unless the `na.strings` parameter is specified differently. – IRTFM Dec 22 '12 at 04:54
  • @DWin: Important distinction. Thanks. – Matthew Lundberg Dec 22 '12 at 04:59

2 Answers2

4

You can use a combination of colClasses with "NULL" entries to "blank-out" the commas (also still needing , fill=TRUE:

read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
 9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
  V1 V2 V3 V4 V5 V6 V7 V8
1  1  2  3  4  5  6  7  8
2  9  9  9  9  9  9  9  9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,",  :
  cols = 26 != length(data) = 38

I needed to add back in the missing linefeed at the end of the first line. (Yet another reason why you should edit questions rather than putting data examples in the comments.) There was an octothorpe in the header which required the comment.char be set to "":

read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")

  Document_Name Sequence_Name Track_Name Type           Name
1       Chr2_FT          Chr2   Chr2.bed  CDS 10000_ARHGAP15
                                                        Sequence  Minimum Min_.with_gaps...  Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421          56019336 55916483
  Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1        56019399     63                 64            1   forward                                     U‌​ser
  Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1                                                                                                    
  Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1                                                      

If you know what your colClasses will be, then you can get missing values to be NA in the numeric columns automatically. You could also use the na.strings setting to accomplish this. You could also do some editing on the header to take out the illegal characters in the column names. (I didn't think I needed to be the one to do that though.)

read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
 Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
  Document_Name Sequence_Name Track_Name Type           Name
1       Chr2_FT          Chr2   Chr2.bed  CDS 10000_ARHGAP15
                                                        Sequence  Minimum Min_.with_gaps...  Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421          56019336 55916483
  Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1        56019399     63                 64            1   forward              <NA>     <NA>          U‌​ser
  Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1              <NA>        <NA>                       <NA>                     <NA>              <NA>
  Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1                  <NA>                            <NA>
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for your suggestions everyone, but is there a way to simply strip the empty white spaces off the end of the header line, as the strip.white functions does not seem to work? I want to write a general script that I can readily apply to a number of different csv output files, which will have variable column names and numbers. – gwilymh Dec 22 '12 at 01:22
  • 1
    You question is not specific enough. It is easy to "strip empty white spaces off the end of the header line" if we are told what the line looks like and how we are supposed to determine what is wheat and what is chaff. Please post a range of examples and then we can offer specific suggestions. – IRTFM Dec 22 '12 at 01:31
  • Sorry to be vague DWin. Here are the first two lines of the csv file. For the life of me, I cannot work out how to take out the extra commas at the end of the Header line. – gwilymh Dec 22 '12 at 02:46
  • Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-value,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Value_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,User,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, – gwilymh Dec 22 '12 at 02:49
  • One of the problems is that you have a # sign (the dreaded octothorpe) in you header. I will see what I can do. – IRTFM Dec 22 '12 at 04:35
2

I have been fiddling with the first two lines of your file, and the problem appears to be the # in one of your column names. read.table treats # as a comment character by default, so it reads in your header, ignores everything after # and returns 13 columns.

You will be able to read in your file with read.table using the argument comment.char="".

Incidentally, this is yet another reason why those who ask questions should include examples of the files/datasets they are working with.

Community
  • 1
  • 1
Blue Magister
  • 13,044
  • 5
  • 38
  • 56