7

I have a large data set (35.8 GB, over 1 billion rows) that I need to tidy.

Here is some reproducible code to show its format.

id <- c(1, 1, 2, 2)
item <- c("a1", "a2", "a1", "a2")
itemformat <- c("AA", "AB", "BB", "AB")
full <- cbind(id, item, itemformat)
full <- as.data.table(full)
full

This code works for this example:

full2 <- full %>% 
  pivot_wider(names_from = item, values_from = itemformat)
full2

However, when using this method on the massive dataset, R has either said that a vector of size 3.8 GB is too large (I restarted R and used the gc() function) or just crashed (after working for over 30 minutes).

I have noticed that fread() and fwrite() from the data.table package are much faster than anything else I've used (e.g., read.csv() and read_csv()).

So, is there a way (per the data.table package, perhaps?) to rearrange this data into having roughly 800,000 columns?

Also, is 800,000 columns above some sort of limit?

After tidying this, the file size should go down (hopefully).

Side note: I have 64 GB of RAM.

Thank you!

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Scott Hebert
  • 133
  • 9
  • 4
    Not sure if it will work, but `dcast` is the *data.table* equivalent, as shown here at one of the canonical questions for reshaping: https://stackoverflow.com/a/36253517/496803 , with the full documentation here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html – thelatemail Oct 20 '22 at 00:40
  • 2
    Rent cloud based hardware for the few minutes it's required. You have exceeded desktop PC sensibilities. Don't fight the future. – Brad Oct 20 '22 at 00:51
  • 2
    Curious: Why do you need to reshape from long to wide? Usually long makes data processing/tidying much easier; and 800,000 columns sounds quite unwieldy. That said, reading a 35.8 GB monolithic file is challenging in itself (nevermind the format). Is there an option to read from a database instead? – Maurits Evers Oct 20 '22 at 01:23
  • @thelatemail Thank you! I'll look into that. I tried looking up the functions, but my quick search turned up dry. Thank you for the links! – Scott Hebert Oct 20 '22 at 01:32
  • @Brad Thank you for the response! I just might look into that lol. – Scott Hebert Oct 20 '22 at 01:34
  • @MauritsEvers Thank you for the response! The data is genetic data, and I am trying to tidy it so that one row is for one person and each column is a single nucleotide polymorphism (SNP). I want to be able to treat each SNP as a variable so I can do some logistic regression-type stuff. I hope that helps explain my though process. – Scott Hebert Oct 20 '22 at 01:36
  • 1
    Do you actually need all 800,000 SNPs? If you’re only interested in a subset, you could filter before pivoting. Otherwise I agree you may need a cloud-based solution. – zephryl Oct 20 '22 at 01:59
  • 1
    @ScottHebert So you are planning to do a regression model with 800,000 SNPs as predictors? That sounds like a challenge from a modelling point of view (sparsity, collinearities, etc.). I agree with zephryl, is there a way to reduce the number of SNPs? I did some work on this many years ago and remember a lot of discussions on ways to identify functional & relevant SNPs. Either way, I don't think there is a simple plug-n-play solution here. – Maurits Evers Oct 20 '22 at 02:11
  • @zephryl I plan to do some dimension reduction to only use the most relevant SNPs (based on whatever method of dimension reduction I use), but I'd like to get the data into this format first. If I could do this with participants each taking up multiple rows, though, I would do that, though I don't know of a way to make that work. Thank you for your response, by the way! – Scott Hebert Oct 20 '22 at 02:17
  • @MauritsEvers I plan to do dimension reduction (LASSO or something similar) to only use the best-predicting SNPs, though I'd like to tidy the data first. If I can do dimension reduction with participants each covering hundreds of thousands of rows, I would do that, though I don't know of a way to make that work. – Scott Hebert Oct 20 '22 at 02:18

3 Answers3

7

I generally agree with the comments that such a large dataset will be hard to work with locally. That said, you could try pivoting in chunks, saving each chunk to disk and then re-loading and binding all the pivoted chunks.

set.seed(13)
library(tidyr)
library(dplyr)
library(purrr)

# example data: 9 cases with 3 SNPs each
(data_long <- expand_grid(id = 1:9, item = paste0("a", 1:3)) |>
  mutate(itemformat = sample(c("AA", "AB", "BB"), 27, replace = TRUE)))
#> # A tibble: 27 × 3
#>       id item  itemformat
#>    <int> <chr> <chr>     
#>  1     1 a1    BB        
#>  2     1 a2    AA        
#>  3     1 a3    AB        
#>  4     2 a1    AA        
#>  5     2 a2    AB        
#>  6     2 a3    AB        
#>  7     3 a1    AA        
#>  8     3 a2    BB        
#>  9     3 a3    AA        
#> 10     4 a1    AB        
#> # … with 17 more rows

# define chunks to process 3 cases at a time
ids <- unique(data_long$id)
cases_per_chunk <- 3
chunks <- tibble(
    start = seq(length(ids), by = cases_per_chunk),
    stop = start + cases_per_chunk - 1
  ) |>
  mutate(filename = paste0("SNPs_wide", row_number(), ".rds"))

# pivot and save in chunks of 3 cases
pwalk(chunks, \(start, stop, filename) {
  data_long |>
    filter(id %in% ids[start:stop]) |>
    pivot_wider(names_from = item, values_from = itemformat) |>
    saveRDS(filename)
})

# load pivoted chunks and row-bind
(data_wide <- map_dfr(chunks$filename, readRDS))
#> # A tibble: 9 × 4
#>      id a1    a2    a3   
#>   <int> <chr> <chr> <chr>
#> 1     1 BB    AA    AB   
#> 2     2 AA    AB    AB   
#> 3     3 AA    BB    AA   
#> 4     4 AB    BB    BB   
#> 5     5 BB    AA    AA   
#> 6     6 AB    AA    BB   
#> 7     7 AA    AA    BB   
#> 8     8 BB    AB    AA   
#> 9     9 AA    BB    BB

Created on 2022-10-20 with reprex v2.0.2

A few other thougts:

  • You could also tinker with reading your initial dataset in chunks (say, 800K rows at a time), then pivoting and saving each chunk as above.
  • Tools optimized for large datasets (e.g., arrow) might be worth a look.
  • Where did these data come from? Is it possible, and perhaps easier, to do something upstream to get the data into a wide format?
  • I don't know much about Bioconductor, but it's a project focused on tools for bioinformatics in R, so I would poke around there if you haven't already.
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • 2
    Thank you for your thorough comment! I didn't get notified of this, so I've already been able to tidy the data using dcast(), but I imagine that your answer will help others out in the future. It also may help me if I face a problem similar to this again. Also, since the data is wider and thus tidy, it has now gone to under 3 GB, which is much more manageable. Anyway, thanks again! – Scott Hebert Oct 27 '22 at 15:56
4

A billion rows is pushing the limits for R on a personal computer in my experience; if you're open to other alternatives, you could process the data out-of-memory using AWK:

cat full.csv
"id","item","itemformat"
"1","a1","AA"
"1","a2","AB"
"2","a1","BB"
"2","a2","AB"

awk 'BEGIN{ FS=OFS="," }
NR==FNR {
    if (NR > 1 ) {
        items[$2]
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{ 
  values[$2] = $3 
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' full.csv full.csv > full2.csv

cat full2.csv
"id","a1","a2"
"1","AA","AB"
"2","BB","AB"

Edit

I think @zephryl's answer is the best solution to your problem (+1), but it might be worth doing some benchmarking to make sure it doesn't require more resources than you have available, or an unreasonable amount of time. For my AWK answer, with an example dataset of a billion rows (~14Gb), this method uses max 500Mb RAM and takes ~45mins to do the pivot on my laptop (MacBook Pro 2017, 2.3GHz Dual-Core i5, 8GB RAM):

# Create a large example dataset (~1 billion rows)
awk 'BEGIN{print "id,item,itemformat"; for (i=800000; i<1000000000; i++) {printf "%d,%s%d,%s\n", i/800000, "a", i%800000+1, (rand()>0.5?"AA":"AB")}}' > full.csv

head full.csv
id,item,itemformat
1,a1,AA
1,a2,AA
1,a3,AB
1,a4,AA
1,a5,AA
1,a6,AA
1,a7,AB
1,a8,AB
1,a9,AA

## "Items" printed in any order
awk 'BEGIN{ FS=OFS="," }
NR==FNR {
    if (NR > 1 ) {
        items[$2]
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{
  values[$2] = $3
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' full.csv full.csv > full2.csv


###############################

# "Items" printed in the 'original' order
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
    PROCINFO["sorted_in"] = "@val_num_asc"
    if (NR > 1 ) {
        items[$2]=NR
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{
  values[$2] = $3
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' test.csv test.csv > output.csv
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • 1
    Thank you for your comment! I wasn't notified of these comments, so I apologize for my delayed response. AWK is great -- I've used it a couple times with this data! Fortunately, though, I was able to tidy the data with dcast(). However, your answer is great and may prove to be useful to others (and perhaps me) in the future! Thanks again! – Scott Hebert Oct 27 '22 at 15:58
  • 1
    Glad you solved your problem @ScottHebert! Please take the time to 'answer your own question' (https://stackoverflow.com/help/self-answer) detailing the `dcast()` approach, as I suspect others will stumble across this question and want to know how you handled it. – jared_mamrot Oct 27 '22 at 23:03
1

Update: I was able to do this with this format:

full2 <- full %>% 
  dcast(id ~ item, value.names = "itemformat")
full2

I deleted my actual code (dumb idea), but I believe that is the format I used. It worked for the sample data set. It was pretty quick and worked for 2 different large data sets.

Scott Hebert
  • 133
  • 9