Break a column at regular intervals into multiple rows

Question

I have a column of numbers in a csv file and I want to break the column at regular intervals and transpose them into multiple rows. For example:

Dummy input file:

Expected output (Breaking at regular intervals of 3):

I am trying to do this in R by using for loop but haven't succeeded. I am not getting the desired output but also there are more than 10 million points like these in a single column. So I am not sure if using loop is an efficient way. I have googled and seen other such queries on stackexchange like split string at regular intervals and How to split a string into substrings of a given length?. But it hasn't solved my problem.

Nevertheless, any help with this is appreciated.

score 2 · Answer 1 · answered Nov 16 '18 at 05:20

2

Here is one base R option. We can pad your input vector/column with NA so that its length becomes a multiple of three. Then, generate index series for each of three columns, and create the desired data frame.

rem <- length(input) %% 3
input <- c(input, rep(NA, ifelse(rem == 0, 0, 3 - rem)))
idx1 <- seq(1, length(input), 3)
idx2 <- seq(2, length(input), 3)
idx3 <- seq(3, length(input), 3)

df <- data.frame(v1=input[idx1], v2=input[idx2], v3=input[idx3])

answered Nov 16 '18 at 05:20

Tim Biegeleisen

502,043
27
286
360

Not for production use, but [here is a small demo](https://rextester.com/CNWC23750) showing that the logic works. – Tim Biegeleisen Nov 16 '18 at 05:26
This works when we take the `input` file as a vector `c(1,2,..)`. However when I import a csv file containing these numbers it doesn't work. – Dark_Knight Nov 16 '18 at 05:38
1

@Dark_Knight Then my code would just require a slight modification. We can replace `input` with the data frame/data table column. – Tim Biegeleisen Nov 16 '18 at 05:42
1

`read.csv( "my_data.csv" )[ ,1 ]` would give you a vector – vaettchen Nov 16 '18 at 05:46
Elegant solution. But if `length(input)` is a very large number (millions) and the break needs to be done at intervals of magnitude thousand, then it won't be possible to generate `idx` sequences manually. – Dark_Knight Nov 16 '18 at 06:09
Yes, it would be, and a sequence in the millions should not be a problem at all in R. Billions is another story...but...do you really have that much data? – Tim Biegeleisen Nov 16 '18 at 06:10
Fortunately and unfortunately yes. I can't even open the csv file. – Dark_Knight Nov 16 '18 at 06:11
_Excel_ file? What does that have to do with R? Just read in the CSV file into R, and go from there. – Tim Biegeleisen Nov 16 '18 at 06:12
@TimBiegeleisen got it. Thanks! – Dark_Knight Nov 16 '18 at 06:13
@TimBiegeleisen What I meant to say earlier was that in place of 3 suppose I have to make the break at 1000. Then in that case do I need to manually create 1000 `idx` sequences like `idx1, idx2, ... , idx1000`? – Dark_Knight Nov 16 '18 at 06:19
@Dark_Knight Yes, for 1000 columns my solution is not so nice. – Tim Biegeleisen Nov 16 '18 at 06:27

Shree · Accepted Answer · 2018-11-16T15:35:28.563

2

Here's a dynamic tidyverse way. Should work for any breaks value.

set.seed(1)
df <- data_frame(x = sample(20, 10))

breaks <- 3

df %>% 
  mutate(
    id = rep(paste0("col", 1:breaks), length.out = nrow(.)),
    rn = ave(x, id, FUN = seq_along)
  ) %>% 
  spread(id, x) %>% 
  select(-rn)

# A tibble: 4 x 3
   col1  col2  col3
  <int> <int> <int>
1     6     8    11
2    16     4    14
3    15     9    19
4     1    NA    NA

# another example with breaks at 6
breaks <- 6

df %>% 
  mutate(
    id = rep(paste0("col", 1:breaks), length.out = nrow(.)),
    rn = ave(x, id, FUN = seq_along)
  ) %>% 
  spread(id, x) %>% 
  select(-rn)

# A tibble: 2 x 6
   col1  col2  col3  col4  col5  col6
  <int> <int> <int> <int> <int> <int>
1     6     8    11    16     4    14
2    15     9    19     1    NA    NA

edited Nov 16 '18 at 15:35

answered Nov 16 '18 at 06:01

Shree

10,835
1
14
36

Thanks. It's almost working. I am encountering an error `Duplicate identifiers for rows (600, 653,...)` while working on the actual data. For small dummy data it works perfectly fine. – Dark_Knight Nov 16 '18 at 11:23
Is your breaks > 26? If so, you need to adjust the `letters[1:breaks]` to something more appropriate. Seems like you are breaking at intervals of 52. Also this question has been marked as duplicate so check out the original question for other answers. – Shree Nov 16 '18 at 13:17
Yes. Originally I am breaking at intervals of 11446. What modifications needs to be done with `letters[1:breaks]` ? – Dark_Knight Nov 16 '18 at 13:52
I have updated the answer to make it scalable to any breaks value. Try it and let me know. – Shree Nov 16 '18 at 15:35
It worked perfectly. Thank you. – Dark_Knight Nov 16 '18 at 16:29

score 1 · Answer 3 · edited Nov 16 '18 at 05:57

1

You can use cut function in dplyr package.

dataframe %>% group_by(column) %>% 
mutate(new_variable = cut(column, breaks=quantile(column, c(0,0.25,0.5,0.75,1), labels=F))

or

#breaks into the intervals you require 
new_variable <- cut(as.numeric(dataset$column),breaks = 3)

And then use melt function in reshape package to transpose column to rows

edited Nov 16 '18 at 05:57

Shree

10,835
1
14
36

answered Nov 16 '18 at 05:17

john doe

21
3

score 1 · Answer 4 · answered Nov 16 '18 at 06:09

If your data is in the form of a vector you can do the following:

data <- c('10', '25', '09', '04', '14', '100', '01',
          '10', '100', '04', '04', '01', '04')
split(data, ceiling(seq_along(data) / 3))

If it is in a data frame this should do it:

library(dplyr)
library(tidyr)
data <- data.frame(
  value = c('10', '25', '09', '04', '14', '100', '01',
        '10', '100', '04', '04', '01', '04'))
data %>%
  mutate(key = rep_len(c('a', 'b', 'c'), length.out = nrow(.))) %>%
  group_by(idx = as.integer((row_number() - 1) / 3)) %>% 
  spread(key, value) %>%
  select(-idx) %>%
  ungroup()

Break a column at regular intervals into multiple rows

4 Answers4