The best way to mark (split?) dataset in each string

Question

I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol ,?

For example: I have strings like:

0123456789012...
1234567890123...

and array of lengths: 5,3,1,4,... then I should get like this:

01234,567,8,9012,...
12345,678,9,0123,...

Could anyone help me with this? Python or R-tools are mostly preferred to me...

A similar option in `R` would be `read.fwf` – akrun Apr 22 '15 at 14:06 — akrun, Apr 22 '15 at 14:06

score 1 · Accepted Answer · answered Apr 22 '15 at 14:05

Pandas could load this using read_fwf:

In [321]:

t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
      0    1  2     3
0  1234  567  8  9012

This will give you a dataframe allowing you to access each individual column for whatever purpose you require

G. Grothendieck · Answer 2 · 2015-04-22T15:53:50.533

1

In R read.fwf would work:

# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)

read.fwf(textConnection(x), widths, colClasses = "character")

giving:

     V1  V2 V3   V4
1 01234 567  8 9012
2 12345 678  9 0123

If numeric rather than character columns were desired then drop the colClasses argument.

edited Apr 22 '15 at 15:53

answered Apr 22 '15 at 14:18

G. Grothendieck

254,981
17
203
341

score 1 · Answer 3 · answered Apr 22 '15 at 14:23

1

Try this in R:

x <- "0123456789012"

y <- c(5,3,1,4)

output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]

answered Apr 22 '15 at 14:23

Eric Brooks

657
5
13

score 0 · Answer 4 · answered Apr 22 '15 at 14:18

One option in R is

indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
         substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"

data

str1 <- '0123456789012'
len <- c(5,3,1,4)

The best way to mark (split?) dataset in each string

4 Answers4

data

Linked