0

I'm implementing a class to store time-series (OHLCV) which will contain methods applied to parsed file. I'm trying to figure it out if there is a faster way to upload the content of each file (.csv which are ≈ 40000 rows) into a std::unordered_map<std::string, OHLCV>. Knowing that the structure of the file is fixed (order of header):

.
├── file.csv
│
└── columns:
    ├── std::string datetime
    ├── float open
    ├── float high
    ├── float low
    ├── float close
    └── float volume

The class is implemented as follows:

class OHLCV {

private:

    const char* format = "%Y-%m-%d %H:%M:%S";
    std::vector<long int> timestamps;
    std::vector<float> opens, highs, lows, closes, volumes;

public:

    void upload(
        const std::string& filepath,
        const char& sep,
        const bool& header
    )
    {
        std::ifstream stream(filepath);
        if (stream) {
            std::string line, timestamp, open, high, low, close, volume;

            if (header) {
                std::getline(stream, line);
            }

            while (std::getline(stream, line)) {

                std::stringstream ss(line);

                std::getline(ss, timestamp, sep);
                std::getline(ss, open, sep);
                std::getline(ss, high, sep);
                std::getline(ss, low, sep);
                std::getline(ss, close, sep);
                std::getline(ss, volume, sep);

                timestamps.emplace_back(timestamp);
                opens.emplace_back(std::stof(open));
                highs.emplace_back(std::stof(high));
                lows.emplace_back(std::stof(low));
                closes.emplace_back(std::stof(close));
                volumes.emplace_back(std::stof(volume));

            }
        }
    }

};

I tried to launch I bit of test to see how the OHLC::upload was performing with and these are some of the registred times:

[timer] ohlcv::upload ~ 338(ms), 338213700(ns)
[timer] ohlcv::upload ~ 329(ms), 329451900(ns)
[timer] ohlcv::upload ~ 345(ms), 345494100(ns)
[timer] ohlcv::upload ~ 328(ms), 328179800(ns)

Knowing that my optimization setting is currently at Maximum Optimization (Favor Speed) (/O2) and I'm testing in Release mode, could I improve the velocity of the upload without using an std::array with a const unsigned int MAX_LEN known at compile time?

Little note: Pandas (Python) takes ≈ 63ms for uploading one of these files.

BloomShell
  • 833
  • 1
  • 5
  • 20
  • 1
    Just a side note: "upload" means sending data to a distant server. You just mean *load* here. – Vincent Fourmond Aug 20 '22 at 16:43
  • A little bit off topic, but ok I guess. Maybe, _read_ohlcv_csv_ or _read_csv_ are even better. – BloomShell Aug 21 '22 at 15:21
  • Have you tried to do a profile on this code to find out where it's spending the most time? – Mark Ransom Aug 22 '22 at 23:13
  • Yes, I tried to optimize this as much as possible but I could only make it as fast as Python (≈ 65ms). I'll post an update on this but main ideas to optimize it are: 1) increasing the buffer size with `input.rdbuf()->pubsetbuf(buffer, sizeof(buffer));`; 2) Reserve capacity to vectors in order to avoid redundant allocations during iteration with `std::filesystem::file_size(filepath) / _LINE_LEN;`; 3) Using `string::find` instead of stringstream to split the string. – BloomShell Aug 23 '22 at 05:31
  • I have also tried the _`C`_ approach (_`cstdio`_) with `fopen` and `fscanf` but it is slower (≈ 85ms) than the above strategy. – BloomShell Aug 23 '22 at 05:43
  • 1
    When you post your update, make it an answer rather than an edit to the question. And I meant "profile" in a very specific sense, to use a tool that tells you what amount of time is spent in each statement or function call. – Mark Ransom Aug 24 '22 at 15:47
  • Can you suggest me some resource/tutorial which explains how to implement a well known version to profile? – BloomShell Aug 24 '22 at 16:29

1 Answers1

0
  1. Increasing buffer size to reduce number of writes. As referenced here, "With a user-provided buffer, reading from file reads largest multiples of 4096 that fit in the buffer"; following a test published in another answer, the optimal buffer size should be around 64KB. Also, note that for my compiler is ok to open the file and then next to call pubsetbuf, but depends on compiler. I found that my optimal size is with char buffer[65536].
  2. Using string::find() instead of stringstream to split the string following advice provided by @rustyx here.
  3. Taking advantage of knowing the structure of data incoming (knowing columns but not number of rows). Therefore, while reading each line not checking for std::string::npos but instead iterating for the expected fields in line.
  4. Instead of using a number of vectors equal to the number of columns, creating a struct which stores the records for each row. Then, initializing a std::vector<record> and reserving enough capacity to avoid multiple needless allocations. To approximate the number of rows I am using std::filesystem::file_size. Note, I am not copying the data into the vector but instead im creating the struct directly inside the vector with records.emplace_back(std::forward<Args>(args)...)
  5. Optimizing type conversion: In my case I got two types of conversion: 1) string to timestamp; In this I found that Howard Hinnat's date.h library brings an incredible speed increase. 2) string to float: In this case I do not need my substring to be allocated but I simply create a std::string_view read by from_chars increase speed quite a lot.
  6. Using static size array: Despite my initial need of dynamism, is not a secret that if you use some kind of static size array you will have the best optimization gain . In my case im using std::shared_ptr < std::array<record, rows>> records;.
BloomShell
  • 833
  • 1
  • 5
  • 20