1

I am using R to construct and analyze a data set created from a Python script that a colleague has created which returns the following structure where 13 refers to the number of samples and 3128 is the number of observations of traits that are coded as a single digit(every single digit after the sample name represents a single column, the value encapsulating the coding for the trait):

13 3128
>1062_0    0000000000[...]
>1066A_0    000001010[...]
>1067A_0    000002010[...]
>1067B_0    110013010[...]
>1067C_0    000024010[...]
>1067D_0    000024010[...]
>1084A_0    200100010[...]
>1084B_0    001005110[...]
>1084C_0    000000010[...]
>1086_0    0100002100[...]
>1087_0    3002040100[...]
>1088_0    0000060111[...]
>C105_0    0000050120[...]

I am working to get these get these data into a data frame which has 13 rows and 3,128 columns.

I have used the read.phylip function of phylotools to read in this file above and can get it into a data.frame:

SL_FFR_input <- read.phylip(fil = "matrix.phy")
SL_FFR_frame <- phy2dat(SL_FFR_input)

However, this results in a data frame of two columns, V1 being the sample names, and V2 being a string of all of the single digit codings.

The frame that would be useful is shown below, where the sample names form the row names and each value now has its own column.

>1062_0     0 0 0 0 0 0 0 0 0[...]
>1066A_0    0 0 0 0 0 1 0 1 0[...]
>1067A_0    0 0 0 0 0 2 0 1 0[...]
>1067B_0    1 1 0 0 1 3 0 1 0[...]
>1067C_0    0 0 0 0 2 4 0 1 0[...]
>1067D_0    0 0 0 0 2 4 0 1 0[...]
>1084A_0    2 0 0 1 0 0 0 1 0[...]
>1084B_0    0 0 1 0 0 5 1 1 0[...]
>1084C_0    0 0 0 0 0 0 0 1 0[...]
>1086_0     0 1 0 0 0 0 2 1 0[...]
>1087_0     3 0 0 2 0 4 0 1 0[...]
>1088_0     0 0 0 0 0 6 0 1 1[...]
>C105_0     0 0 0 0 0 5 0 1 2[...] 

It would be a huge help if someone could point me in the right direction!

BTS
  • 13
  • 2
  • 1
    Do you have control/access to Python script? Does it use a `pandas` data frame? Perhaps working from source can help. Also, do you really want 3,128 columns? If they were originally observations, keep them that way with 13 samples as columns. In most data structures, columns are more expensive than rows in memory, hard disk, processing, requiring restructuring and various management like naming, data types, etc. – Parfait Oct 11 '15 at 19:25
  • this answer may help: http://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame – bjoseph Oct 11 '15 at 19:29
  • Parfait: I do have access to the Python script, but I do not have expertise in the language. I suppose that I could ask that it be output as a .csv file or some such delimited and then constructing my frame would be very straightforward. Also, I really do want to keep all 3,128 columns. Each contains information from one DNA region. – BTS Oct 11 '15 at 19:43

1 Answers1

0

I recommend dplyr + tidyr, it's possible to do this with strsplit and rbind, but it's ugly.

library(dplyr)
library(tidyr)
df1 <- data.frame(snames = c('a','b','c'),
                  digits = c('0000000000000',
                             '0000100000000',
                             '0000000001000'))
result <- df1 %>% separate(digits, paste0('X',1:13),sep = 1:12)

that will separate at the character positions 1:12 in the column, and name the columns X1 -> X13

EDIT: for your case change the 13 to 3128, and the 12 to 3127, "digits" to whatever the name of your column is

Shape
  • 2,892
  • 19
  • 31