R: Read in .csv file and convert into multiple column data frame

Question

I am new to R and currently having a plenty of trouble just reading in .csv file and converting it into data.frame with 7 columns. Here is what I am doing:

gene_symbols_table <- as.data.frame(read.csv(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE, sep=","))

After that I am getting a data.frame with dim = 46761 x 1, but I need it to be 46761 x 7. I tried the following stackoverflow threads:

But somehow nothing is working in my case. Here is how the table looks:

> head(gene_symbols_table, 3)
input.reason.matches.organism.name.primaryIdentifier.symbol.briefDescription.c
lass.secondaryIdentifier
1                     WBGene00008675 MATCH 1 Caenorhabditis elegans    
WBGene00008675 irld-26  Gene F11A5.7
2                      WBGene00008676 MATCH 1 Caenorhabditis elegans 
WBGene00008676 oac-15  Gene F11A5.8
3                            WBGene00008677 MATCH 1 Caenorhabditis elegans 
WBGene00008677   Gene F11A5.9

The .csv file in Excel looks like this:

input   |  reason   |  matches  |   organism.name  |    primaryIdentifier   |  symbol   | 
briefDescription
WBGene00008675  |   MATCH  |    1     |   Caenorhabditis elegans    WBGene00008675  |   irld-26   |   ...   
...

The following code:

gene_symbols_table <- read.table(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=FALSE, sep=",", 
col.names = paste0("V",seq_len(7)), fill = TRUE)

Seems to be working, however when I look into dim I can see right away that it is wrong: 20124 x 7. Then:

V1
1input;reason;matches;organism.name;primaryIdentifier;symbol;briefDescription;class;secondaryIdentifier
2                     WBGene00008675;MATCH;1;Caenorhabditis 
elegans;WBGene00008675;irld-26;;Gene;F11A5.7
3                      WBGene00008676;MATCH;1;Caenorhabditis 
elegans;WBGene00008676;oac-15;;Gene;F11A5.8
  V2 V3 V4 V5
1            
2            
3

1

So, it is wrong

Other attempts at read.table are giving me the error specified in the second stackoverflow thread.

I have also tried splitting the data.frame with one column into 7, but so far no success.

I think you'll need to include more lines of the file (as displayed in a text editor, not Excel) in order to get help. Your Excel snippet suggests you might need a `sep = "|"` argument but this remains unclear. Also, the response from `read.csv()` is a data frame, so you don't need `as.data.frame()`. — Thomas, Dec 13 '17 at 21:57
I added '|' myself here for the sake of visualizing it better. In Excel these are just cells — Nikita Vlasenko, Dec 13 '17 at 21:58
@NikitaVlasenko Do you have any way of knowing if your data is 'ragged', meaning that some rows could have more or less than 7 columns? Another reason you could have that error is if you have an index column in your data without a column name. — Nate, Dec 13 '17 at 21:59
The data is fine for sure. I just noticed that when I open the file with LibreOffice I see `Separated by: semicolon, space`. — Nikita Vlasenko, Dec 13 '17 at 22:02

score 4 · Accepted Answer · answered Dec 13 '17 at 22:00

The sep seems to be space or semi-colon, and not comma from what the table looks like. So either try specifying that, or you could try fread from the data.table package, which automatically detects the separator.

gene_symbols_table <- as.data.frame(fread(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE))

R: Read in .csv file and convert into multiple column data frame

1 Answers1