3

I am new to R and currently having a plenty of trouble just reading in .csv file and converting it into data.frame with 7 columns. Here is what I am doing:

gene_symbols_table <- as.data.frame(read.csv(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE, sep=","))

After that I am getting a data.frame with dim = 46761 x 1, but I need it to be 46761 x 7. I tried the following stackoverflow threads:

  1. How can you read a CSV file in R with different number of columns

  2. read.delim() - errors "more columns than column names" and "header and ''col.names" are of different lengths"

  3. Split a column of a data frame to multiple columns

But somehow nothing is working in my case. Here is how the table looks:

> head(gene_symbols_table, 3)
input.reason.matches.organism.name.primaryIdentifier.symbol.briefDescription.c
lass.secondaryIdentifier
1                     WBGene00008675 MATCH 1 Caenorhabditis elegans    
WBGene00008675 irld-26  Gene F11A5.7
2                      WBGene00008676 MATCH 1 Caenorhabditis elegans 
WBGene00008676 oac-15  Gene F11A5.8
3                            WBGene00008677 MATCH 1 Caenorhabditis elegans 
WBGene00008677   Gene F11A5.9

The .csv file in Excel looks like this:

input   |  reason   |  matches  |   organism.name  |    primaryIdentifier   |  symbol   | 
briefDescription
WBGene00008675  |   MATCH  |    1     |   Caenorhabditis elegans    WBGene00008675  |   irld-26   |   ...   
...

The following code:

gene_symbols_table <- read.table(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=FALSE, sep=",", 
col.names = paste0("V",seq_len(7)), fill = TRUE)

Seems to be working, however when I look into dim I can see right away that it is wrong: 20124 x 7. Then:

V1
1input;reason;matches;organism.name;primaryIdentifier;symbol;briefDescription;class;secondaryIdentifier
2                     WBGene00008675;MATCH;1;Caenorhabditis 
elegans;WBGene00008675;irld-26;;Gene;F11A5.7
3                      WBGene00008676;MATCH;1;Caenorhabditis 
elegans;WBGene00008676;oac-15;;Gene;F11A5.8
  V2 V3 V4 V5
1            
2            
3        

1

So, it is wrong

Other attempts at read.table are giving me the error specified in the second stackoverflow thread.

I have also tried splitting the data.frame with one column into 7, but so far no success.

Nikita Vlasenko
  • 4,004
  • 7
  • 47
  • 87
  • What happens when you change `sep=','` to `sep=';'`? – Nate Dec 13 '17 at 21:54
  • `more columns than column names` error – Nikita Vlasenko Dec 13 '17 at 21:56
  • 1
    I think you'll need to include more lines of the file (as displayed in a text editor, not Excel) in order to get help. Your Excel snippet suggests you might need a `sep = "|"` argument but this remains unclear. Also, the response from `read.csv()` is a data frame, so you don't need `as.data.frame()`. – Thomas Dec 13 '17 at 21:57
  • I added '|' myself here for the sake of visualizing it better. In Excel these are just cells – Nikita Vlasenko Dec 13 '17 at 21:58
  • @NikitaVlasenko Do you have any way of knowing if your data is 'ragged', meaning that some rows could have more or less than 7 columns? Another reason you could have that error is if you have an index column in your data without a column name. – Nate Dec 13 '17 at 21:59
  • The data is fine for sure. I just noticed that when I open the file with LibreOffice I see `Separated by: semicolon, space`. – Nikita Vlasenko Dec 13 '17 at 22:02

1 Answers1

4

The sep seems to be space or semi-colon, and not comma from what the table looks like. So either try specifying that, or you could try fread from the data.table package, which automatically detects the separator.

gene_symbols_table <- as.data.frame(fread(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE))
phil_t
  • 851
  • 2
  • 7
  • 17