3

I have have a data.frame that is 800x28000. I would like to spilt the column based on a pattern and bind the rows. However, I need to create a new column called "ticker" along the way.

Here is the data frame.

df1  = structure(list(AAPL.Price = c(20, 21), AAPL.Volume = c(100, 105
), MSFT.Price = c(30, 31), MSFT.Volume = c(150, 160)), class = "data.frame", row.names = c(NA, 
-2L))

print(df1)

  AAPL.Price AAPL.Volume MSFT.Price MSFT.Volume
1         20         100         30         150
2         21         105         31         160

I would like the final data frame to look like:

  Price Volume Ticker
1    20    100   AAPL
2    21    105   AAPL
3    30    150   MSFT
4    31    160   MSFT

I have built a not so good work around. The answer I am looking for is a slight variation to the answer that Akrun provided here. Split Data Every N Columns and rbind Using R The only difference being, I need to create an extra column "Ticker" as I am binding the rows.

Here is my current work around.

library(foreach)

library(tidyverse)

ticker = c("AAPL", "MSFT")

rbind.table = function(ticker){
  header = paste0("^", ticker)
  df2 = df1[str_detect(names(df1), header)]%>%mutate(ticker = ticker)
  names(df2) = c("Price", "Volume", "Ticker")
  return(df2)
}


df2 = foreach(r = ticker, .packages = c("tidyverse", "stringr"), .combine = "bind_rows") %dopar% {rbind.table(r)}

print(df2)

  Price Volume Ticker
1    20    100   AAPL
2    21    105   AAPL
3    30    150   MSFT
4    31    160   MSFT

Here is a sreen shot of the real data.frame. enter image description here

Jordan Wrong
  • 1,205
  • 1
  • 12
  • 32
  • I have 28k columns and it takes 11 seconds. I thought there might be a faster and "best practice way" using data.table. – Jordan Wrong Sep 10 '19 at 18:58

2 Answers2

2

One dplyr and tidyr possibility could be:

df1 %>%
 rowid_to_column() %>%
 gather(var, val, -rowid) %>%
 separate(var, c("Ticker", "unit")) %>%
 spread(unit, val) %>%
 select(-rowid)

  Ticker Price Volume
1   AAPL    20    100
2   AAPL    21    105
3   MSFT    30    150
4   MSFT    31    160
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Hi @tmnfk, In my actual data set I have 25 variables for each ticker. So something like (Price, Volume, Implied Vol, Slope....). I am leaning more towards a data.table pattern recognition. – Jordan Wrong Sep 10 '19 at 19:08
  • If it has a structure as given in the example (i.e. `.Price`, `.Volume`), then is should work also with other variables. If your actual data is different, please post an appropriate example, otherwise it will difficult to help you :) – tmfmnk Sep 10 '19 at 19:10
  • thanks @tmfmnk, I added a screen shot of the actual data.frame. I tried your code, but I can not seem to get it to work. I think I have to specify a different parameter in the separate() function – Jordan Wrong Sep 10 '19 at 19:18
  • In the case of your actual data, what is the separator that separates the tickers and the units? For the example data, it was a `.`. – tmfmnk Sep 10 '19 at 19:24
  • It took me a bit to figure this out, but I got it to work! Thank you tmfmnk! Its much appreciated. However I am going to except the above answer solely because it is in data.table. Thanks again :) – Jordan Wrong Sep 10 '19 at 20:05
1

The column name pattern would need to be updated to match the screenshot you just added, but could probably still adapt this:

library(data.table)
df1 <- structure(
  list(
    AAPL.Price = c(20, 21),
    AAPL.Volume = c(100, 105), 
    MSFT.Price = c(30, 31), 
    MSFT.Volume = c(150, 160)
  ), class = "data.frame", row.names = c(NA,-2L)
)
dt1 <- as.data.table(df1)
tickers <- unique(sub(pattern = "\\..*$", replacement = "", x = colnames(dt1)))
vars <- unique(sub(pattern = ".*\\.(.*)$", replacement = "\\1", x = colnames(dt1)))

dt1_melted <- melt(dt1, measure = patterns(paste0(".", vars)), value.name = vars)[, Ticker := tickers[variable]][]
> dt1_melted
   variable Price Volume Ticker
1:        1    20    100   AAPL
2:        1    21    105   AAPL
3:        2    30    150   MSFT
4:        2    31    160   MSFT
clmarquart
  • 4,721
  • 1
  • 27
  • 23