6

I have a ~200gb dataset of approx 1.5 bln observations, on which I need to run some conditional analysis and data aggregation*.

The thing is that I'm not used to (nor trained to handle) large datasets. I usually work on R or Python (with some Julia on the side), and I am completely lost when I just can't fit the dataset into memory.

How do people handle these datasets, that fit on the disk but not in the memory ? Where should I start looking for solutions ? Is there a place where information on large yet not big data datasets is centralized ?

*Long story short, I have another dataset (which fits into memory), and for each row of this small dataset I want to count the number of observations in the large dataset that match some conditions from the small dataset. My initial reaction was to run the code in chunks, but this is very inefficient and would take centuries of monoprocessor computing time.

Since it has been specifically asked, I will describe the structure of my file.

I have a big file, let us call it BIG, with (notably) two ID variables, $ID0$ and $ID1$ and a date variable $date1$.

I have a small file, let us call it SMALL, with two ID variables, $ID2$ and $ID3$, and a date variable $date2$.

For each $ID2_i$, I want to count all observation such that $\{ID0 = ID2_i, date1<date2_i, ID1=ID2_j | j : ID3_j = ID3_i \cap date2_j < date2_i \}$

  • Alright, I can get access to a memory optimized server. I shall speak with the scientific computing department then. Concerning random sampling, the issue is that the statistic of interest, as you say, is the micro level data, that I will then use in an estimation process. I'm not comfortable with bootstrapping my way out of this when the data is there and accessible. –  Feb 06 '20 at 16:31
  • 1
    You need to provide the specific structure of your file and what information you want to extract from it. Just think what information you really need and don't just put the raw file into your ram. For example you might not need the exact value but could directly fill that value into a histogram. I bet your questions could be solved with only some single mb of RAM! – KaPy3141 Feb 05 '20 at 15:56
  • Get more RAM, use random sampling, process the data in pieces, use an approximation... – Sycorax Feb 05 '20 at 16:01
  • Yes, I could use 1mb of ram to solve the problem. The main issue is that I would then need to run the conditional count once per row of my small dataset, which means 1.6mln times, which is very long when each count is over more than 1.5 bln observations. I can't approximate either since I need the exact count. I'm not asking for an answer for my problem since it is long and uninteresting, but mostly for resources online or a direction on good practices. –  Feb 05 '20 at 16:05
  • How would you run it if it were "big data"? – Dave Feb 05 '20 at 16:10
  • 2
    Data are not important to statisticians without an application in mind. Regressions, plots, etc. all have convenient ways of chunking which *could* be an on-topic question. All one can say in general is get a SAS license. – AdamO Feb 05 '20 at 19:59
  • what format is your file? text/csv? xml? bytes? – julian bechtold Feb 05 '20 at 20:39
  • Perhaps the more useful question is "How can I use statistical principles such as random sampling to estimate my statistic of interest?" – Sycorax Feb 05 '20 at 22:56
  • Depending on how long it would take to run the analysis and if your tool of choice can handle a file like that, you might try a memory optimized cloud computing node. Its possible that something with say, 700 GB of RAM might be enough to run the analysis... – Mike Burr Feb 06 '20 at 01:55
  • My approach to running queries on very large (compressed) csv files: https://stackoverflow.com/a/68693819/8079808 – San Aug 09 '21 at 10:17

3 Answers3

3

I might be misunderstanding your problem, but chunking the big file (as suggested in comments already) seems to me the most straightforward approach.

Say you divide the 200 GB file into 100 chunks, you then iterate over the chunks and for each chunk do the desired counting, followed by aggregation of the results. If the per-chunk operation runs in minutes, you should be fine unless you want to do this over and over again.

For more concrete suggestions, I'd need to know a bit more about the data storage format. Are we talking about a big .csv file? In that case, for R you might look into the chunked API of the readr package. For doing the counting as quickly as possible again in R, the data.table package might come in handy.

Edit: Adding some example code

This won't do exactly what you requested, but hopefully covers some of the key points to make a solution as I suggested work.

library(data.table)
library(readr)

ids <- seq.int(1, 1e2)
dates <- seq(as.Date("1999/01/01"), as.Date("2000/01/01"), by = "day")

big <- data.table(id0 = sample(ids, 1e6, replace = TRUE),
                  id1 = sample(ids, 1e6, replace = TRUE),
                  date1 = sample(dates, 1e6, replace = TRUE))

write.csv(big, "big.csv", row.names = FALSE)

small <- data.table(id2 = sample(ids, 1e2),
                    id3 = sample(ids, 1e2),
                    date2 = sample(dates, 1e2))

count_fun <- function(x, pos, acc) {
  setDT(x)
  tmp <- small[x, list(counts = .N),
               on = c("id2 == id0", "id3 == id1", "date2 > date1"),
               by = .EACHI, nomatch = NULL]
  acc[tmp$id2] <- acc[tmp$id2] + tmp$counts
  acc
}

accumulator <- AccumulateCallback$new(count_fun, acc = rep(0, length(ids)))

counts <- read_csv_chunked("big.csv", accumulator, chunk_size = 1e4)
nbenn
  • 591
  • 4
  • 12
  • Yeah it's csv, but I can get it in another format, that is not an issue. –  Feb 06 '20 at 16:24
  • I updated my question with some example code in R. feel free to ask if you have any questions. – nbenn Feb 06 '20 at 17:43
3

There are different methods

Chunk up the dataset (saves time in future but needs initial time invest)

Chunking allows you to ease up many operations such as shuffling and so on.

Make sure each subset/chunk is representative of the whole Dataset. Each chunk file should have the same amount of lines.

This can be done by appending a line to one file after another. Quickly, you will realize that it's inefficient to open each file and write a line. Especially while reading and writing on the same drive.
-> add Writing and Reading buffer which fits into memory.

enter image description here
enter image description here

Choose a chunksize that fits your needs. I choose this particular size because my default text editor can still open it fairly quickly.

Smaller chunks can boost performance especially if you want to get metrics such as class ditribution because you only have to loop through one representative file to get an estimation of the whole dataset which might be enough.
Bigger chunkfiles do have a better representation of the whole dataset in each file but you could as well just go through x smaller chunkfiles.

I do use c# for this because I am way more experienced there and thus I can use the full featureset such as splitting the tasks reading / processing / writing onto different threads.

If you are experienced using python or r, I suspect there should be simillar functionalities as well. Parallelizing might be a huge factor on such large Datasets.

Chunked datasets can be modeled into one interleaved dataset which you can process with tensor processing units. That would probably yield one of the best performances and can be executed locally as well as in the cloud on the really big machines. But this requires a lot of learning on tensorflow.

Using a reader and read the file step by step

instead of doing something like all_of_it = file.read() you want to use some kind of streamreader. The following function reads through one of the chunk files (or your whole 300gb dataset) line by line to count each class within the file. By processing one line at a time, your program will not overflow the memory.

you might want to add some progress indication such as X lines/s or X MBbs in order to make an estimation of the total process time.

def getClassDistribution(path):
    classes = dict()
    # open sample file and count classes
    with open(path, "r",encoding="utf-8",errors='ignore') as f:
        line = f.readline()
        while line:
            if line != '':
                labelstring = line[-2:-1]
                if labelstring == ',':
                    labelstring = line[-1:]
                label = int(labelstring)
                if label in classes:
                    classes[label] += 1
                else:
                    classes[label] = 1
            line = f.readline()
    return classes

enter image description here

I use a combination of chunked datasets and estimation.

Pitfalls for performance

  • whenever possible, avoid nested loops. Each loop inside another loop multiplies the complexity by n
  • whenever possible, process the data in one go. Each loop after another adds a complexity of n
  • if your data comes in csv format, avoid premade functions such as cells = int(line.Split(',')[8]) this will lead very quickly to a memory throughput bottleneck. One proper example of this can be found in getClassDistributionwhere I only want to get the label.

the following C# function splits a csv line into elements ultra fast.

// Call function
ThreadPool.QueueUserWorkItem((c) => AnalyzeLine("05.02.2020,12.20,10.13").Wait());

// Parralelize this on multiple cores/threads for ultimate performance
private async Task AnalyzeLine(string line)
{
    PriceElement elementToAdd = new PriceElement();
    int counter = 0;
    string temp = "";
    foreach (char c in line)
    {
        if (c == ',')
        {
            switch (counter)
            {
                case 0:
                    elementToAdd.spotTime = DateTime.Parse(temp, CultureInfo.InvariantCulture);
                    break;
                case 1:
                    elementToAdd.buyPrice = decimal.Parse(temp);
                    break;
                case 2:
                    elementToAdd.sellPrice = decimal.Parse(temp);
                    break;
            }
            temp = "";
            counter++;
        }
        else temp += c;
    }
    // compare the price element to conditions on another thread
    Observate(elementToAdd);
}

Create a database and load the data

when processing csv like data you can load the data into a Database.
Databases are made to accommodate for huge amount of data and you can expect very high performance.
A Database will likely use up way more space on your disk than raw data. This is one reason why I moved away from using a database.

Hardware Optimisations

If your code is optimized well your bottleneck will most likely be the hard drive throughput.

  • If the Data fits onto your local hard drive, use it locally as this will get rid of network latencies (imagine 2-5ms for each record in a local network and 10-100ms in remote locations).
  • Use a modern Harddrive. A 1tb NVME SSD costs around 130 today (intel 600p 1tb). An nvme ssd is using pcie and is around 5 times faster than a normal ssd and 50 times faster than a normal hard drive, especially when writing to different locations fastly (chunking up data). SSDs have cought up vastly in capacity in the recent years and for such a task it would be savage.

The following screenshot provides a performance comparisation of tensorflow training with the same data on the same machine. Just one time saved locally on a standard ssd and one time on a network attached storage in the local network (normal hard disk).
enter image description here
enter image description here

julian bechtold
  • 1,875
  • 2
  • 19
  • 49
1

Looks like an O(n^2) problem: each element in BIG has to be compared with all the others in BIG.

Maybe you can fit all fields required in memory for the comparison (leaving in the file the rest). For example: 1.5G observations x 1 date (4 bytes) x 2 IDs (8 bytes) can fit in 18GB.

Maybe you can sort BIG by date and then your problem becomes O(n x log(n)).

Maybe you can split BIG in chunks where ID3i = ID3j.

There's lots of possibilities.

  • For many such joins, it's possible to reduce the $O(n^2)$ performance to $O(n\log(n))$ by means of a suitable index. This is one of the services a robust database platform provides. – whuber Feb 06 '20 at 16:51