0

This is a follow up to this SO question

NumPy "record array" or "structured array" or "recarray"

I am unable to figure out which one is the best for my situation.

For my data, one column is an int, and the other column is a variable length (2-150) batch of ints.

Below is code which downloads a small piece (10 mbs) of data and opens it in Pandas

import requests
import pickle
import numpy as np
import pandas as pd

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')

sampleDF = pd.read_pickle('sample.pkl')

sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))

Here is a notebook so the user doesn't have to download anything onto their system

https://colab.research.google.com/drive/1kaaYk5_xbzQcXTr_DhjuWQT_3S4E-rML

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • For variable-length lists of ints? Numpy itself does not seem ideal, what exactly is your use-case? – juanpa.arrivillaga Aug 27 '19 at 18:43
  • The linked SO should make it clear that the 3 structures store the data in the same way. They just differ in some constructors and field access methods. You could define a dtype an `int` field and an `object` field (to hold the lists or arrays). `sampleDF.to_records()` would do something similar. But I'm not sure this does anything special for you. – hpaulj Aug 27 '19 at 19:23
  • @juanpa.arrivillaga This is for machine learning training. The first int represents a label, the others represent inputs. I heard that numpy records is the best format for this, but if not, is there a better format for containing as much of this data into as little ram as possible? – SantoshGupta7 Aug 27 '19 at 21:59
  • It's too early to worry about ram use. Converting between different storage formats is relatively easy. For development/testing purposes you want to work with a modest size data base, not one that stretches you memory. Initially focus on what data is useful. While not an expert in machine learning, my impression is that 'variable length batches' is not a good input for that. Fixed length 'inputs' are optimal for memory use; a (n, 150) 2d array will be better, memory use and calculations, than a (n,) array of variable length lists/arrays. – hpaulj Aug 27 '19 at 23:26
  • Another way to put that last comment - make sure you understand the machine learning code and its data expectations. That's the starting point. – hpaulj Aug 27 '19 at 23:32
  • I know, but that is just how the data is. Perhaps padding could be an option, though the the scale of variation, I doubt it would end up taking less memory. – SantoshGupta7 Aug 28 '19 at 00:14

0 Answers0