0

I have a strange error on some of the data frames I'm working on, shifting all my rows by one cell so they're no more corresponding to my data:

Instead of having this:

> head(xaa.small)
                     AGCATTCGAAACATCGAGGCTAACATCCAGTACGCAAGTGGCC AGCATTCGAAACATCGCCAGTTCAATCCATCTTCACAGTGGCC
hg19_ENSG00000000003                                           0                                           0
hg19_ENSG00000000419                                           0                                           0
hg19_ENSG00000000457                                           0                                           0

It looks like this:

> head (xab.small)
                     AGCATTCGAAACATCGAGGCTAACATCCAGTACGCAAGTGGCC AGCATTCGAAACATCGCCAGTTCAATCCATCTTCACAGTGGCC
                                                               0                                           0
hg19_ENSG00000103160                                           0                                           0
hg19_ENSG00000103168                                           0                                           0

There's that empty space that appears on some of the dataframes.

How could I erase it and "repair" my dataframes in R? Or import it more efficiently? fill = True when importing the dataframe works but blocks me for the further analysis that I have to do.

The data is comming from a huge tsv file that was cut in multiple parts. Maybe during the cutting process there was an error (split function was used in order to cut the initial tsv file in several 200 Mb files)

Ondy
  • 3
  • 5
  • Please add your data using `dput()`. It's hard to see your exact data structure. – tmfmnk Feb 18 '20 at 08:59
  • Where does the data for the data.frames come from? I.e. how are you reading the data e.g. `read.csv`, `read.table`, `openxlsx` etc. ? It's probably going to be better to fix the import than trying to "repair" the data.frames with the weird structure... If you are reading from a text file, can you post the first few lines of said file? – dario Feb 18 '20 at 09:01
  • I can't really show the real data because 1. it's confidential 2. it's massive ... there are 48178 columns, I'll try to show a part of it. – Ondy Feb 18 '20 at 09:05
  • I used read.table or read_tsv (from readr package) ... I modified the question in order to give more details, the data come from a huge tsv file that was cut in multiple pieces and added the heathers afterwards. – Ondy Feb 18 '20 at 09:06
  • Here, I added the head of my data, the row names are shifted by one cell and one empty cell is created – Ondy Feb 18 '20 at 09:44
  • Please add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) that reproduces the error. You mentioned `split` was used in the last comment but don't show any code... If you create a MRE that reproduces the error you can help others to help you! – dario Feb 18 '20 at 10:09
  • 1
    Okay, I'll create one :) Thank you for the suggestion ! – Ondy Feb 18 '20 at 10:16

1 Answers1

0

I identified the error:

When the initial file was cut into several pieces, it was split by bytes and not by lines. So, it cut sometimes at the last column of the last line or something like this so it generated incomplete lines or en empty space was created ...

I corrected the error simply by using split -l (number of lines) file instead of split -b (number of bytes)

Ondy
  • 3
  • 5