2

I have a large dataset containing 40 million records with total size ~21.0G, stored in MongoDB. Took me a few hours to load it into pandas dataframe. But the total memory size increased to about 28.7G (before loading, it was ~600Mb).

cursor = mongocollection.find()
data = pd.DataFrame()
count = 0
dataset = []
for i in cursor:
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        print(count)
        temp = pd.DataFrame(dataset, columns=dataset[0].keys())
        dataset = []
        data = data.append(temp)

temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)

Concerned about the time cost by loading such dataset, saved it to local disk with pd.to_csv('localdisk.csv'). The size of csv file is 7.1Gb.

So the question is why the csv file is so small, while memory size of dataframe (or other data?) used is about 4 times larger, and is there any better solution to reduce the memory usage by dataframe. I have another dataset with more than 100 million same items. Wonder if I'm able to load into memory with such solution.

PS: I thinks the reason that cost so much time to load data into memory is these three commands:

temp = pd.DataFrame(dataset, columns=dataset[0].keys())
dataset = []
data = data.append(temp)

There are 60,000 items in dataset, and it took ~5-10 minutes to load into data (pandas DataFrame)

> data.memory_usage(index=True).sum() 
6451973127 bytes # About 6G, close to the size of csv file.

UPDATE:

Code for generating metrics

This SO answer says concat is efficient than append. I didn't test that yet.

last_time = time.time()
for i in cursor:
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        temp = pd.DataFrame(dataset, columns=dataset[0].keys())
        dataset = []
        data = pd.concat([data,temp])
        current_time = time.time()
        cost_time = current_time - last_time
        last_time = current_time
        memory_usage = psutil.virtual_memory().used / (1024**3)
        print("count is {}, cost time is {}, memory usage is {}".format(count, cost_time, memory_usage))

metrics of loading data into dataframe

count is 100000, cost time is 12.346338033676147, memory usage is 0.7630538940429688
count is 200000, cost time is 8.272525310516357, memory usage is 0.806121826171875
count is 300000, cost time is 10.19885516166687, memory usage is 0.9408340454101562
count is 400000, cost time is 6.370742082595825, memory usage is 0.9675140380859375
count is 500000, cost time is 7.93895959854126, memory usage is 0.9923629760742188
count is 600000, cost time is 12.54422402381897, memory usage is 1.1193618774414062
count is 700000, cost time is 9.631025552749634, memory usage is 1.1592445373535156
count is 800000, cost time is 7.459330081939697, memory usage is 1.1680374145507812
count is 900000, cost time is 9.528786659240723, memory usage is 1.2807159423828125
count is 1000000, cost time is 7.681959867477417, memory usage is 1.2977333068847656
count is 1100000, cost time is 7.3086090087890625, memory usage is 1.3396949768066406
count is 1200000, cost time is 11.282068252563477, memory usage is 1.4544296264648438
count is 1300000, cost time is 9.21155858039856, memory usage is 1.4788284301757812
count is 1400000, cost time is 10.056787014007568, memory usage is 1.5263175964355469
count is 1500000, cost time is 12.212023973464966, memory usage is 1.6380157470703125
count is 1600000, cost time is 14.238991260528564, memory usage is 1.69512939453125
count is 1700000, cost time is 8.800130128860474, memory usage is 1.7134437561035156
count is 1800000, cost time is 11.374922275543213, memory usage is 1.8270645141601562
count is 1900000, cost time is 8.9767906665802, memory usage is 1.8472061157226562
count is 2000000, cost time is 8.989881992340088, memory usage is 1.8804588317871094
count is 2100000, cost time is 11.93136477470398, memory usage is 2.000476837158203
count is 2200000, cost time is 11.224282264709473, memory usage is 2.016876220703125
count is 2300000, cost time is 13.535449266433716, memory usage is 2.0394668579101562
count is 2400000, cost time is 12.848443269729614, memory usage is 2.1280059814453125
count is 2500000, cost time is 12.208937883377075, memory usage is 2.138408660888672
count is 2600000, cost time is 16.975553512573242, memory usage is 2.2880821228027344
count is 2700000, cost time is 19.275086879730225, memory usage is 2.287738800048828
count is 2800000, cost time is 11.386988639831543, memory usage is 2.3098106384277344
count is 2900000, cost time is 13.70014500617981, memory usage is 2.3990440368652344
count is 3000000, cost time is 10.45867395401001, memory usage is 2.420604705810547
count is 3100000, cost time is 10.75408387184143, memory usage is 2.4437637329101562
count is 3200000, cost time is 15.346243619918823, memory usage is 2.5608978271484375
count is 3300000, cost time is 12.275937795639038, memory usage is 2.5855789184570312
count is 3400000, cost time is 11.398426532745361, memory usage is 2.6102142333984375
count is 3500000, cost time is 17.990268230438232, memory usage is 2.7031402587890625
count is 3600000, cost time is 11.90847396850586, memory usage is 2.724163055419922
count is 3700000, cost time is 14.961709260940552, memory usage is 2.8711891174316406
count is 3800000, cost time is 13.13991904258728, memory usage is 2.8688430786132812
count is 3900000, cost time is 12.900552749633789, memory usage is 2.8935928344726562
count is 4000000, cost time is 15.278205633163452, memory usage is 3.01715087890625
count is 4100000, cost time is 12.421746492385864, memory usage is 3.044261932373047
count is 4200000, cost time is 12.715410232543945, memory usage is 3.1170883178710938
count is 4300000, cost time is 15.297654867172241, memory usage is 3.195178985595703
count is 4400000, cost time is 11.920997858047485, memory usage is 3.2213592529296875
count is 4500000, cost time is 12.397282123565674, memory usage is 3.2494659423828125
count is 4600000, cost time is 13.162795305252075, memory usage is 3.3564605712890625
count is 4700000, cost time is 14.042455434799194, memory usage is 3.413494110107422
count is 4800000, cost time is 10.402931451797485, memory usage is 3.3945388793945312
count is 4900000, cost time is 13.326395034790039, memory usage is 3.4888954162597656
count is 5000000, cost time is 11.762998580932617, memory usage is 3.5169677734375
count is 5100000, cost time is 13.566682577133179, memory usage is 3.610504150390625
count is 5200000, cost time is 11.697095155715942, memory usage is 3.637969970703125
count is 5300000, cost time is 11.785945415496826, memory usage is 3.702167510986328
count is 5400000, cost time is 20.747815132141113, memory usage is 3.7620506286621094
count is 5500000, cost time is 12.001267910003662, memory usage is 3.788776397705078
count is 5600000, cost time is 12.201840877532959, memory usage is 3.8513031005859375
count is 5700000, cost time is 16.82955837249756, memory usage is 3.9653396606445312
count is 5800000, cost time is 12.35794973373413, memory usage is 3.9715538024902344
count is 5900000, cost time is 12.41870403289795, memory usage is 3.999217987060547
count is 6000000, cost time is 14.590713024139404, memory usage is 4.0941619873046875
count is 6100000, cost time is 13.40040898323059, memory usage is 4.119499206542969
count is 6200000, cost time is 15.54603385925293, memory usage is 4.2159881591796875
count is 6300000, cost time is 12.232314348220825, memory usage is 4.2417449951171875
count is 6400000, cost time is 12.939337491989136, memory usage is 4.268760681152344
count is 6500000, cost time is 15.472190856933594, memory usage is 4.371849060058594
count is 6600000, cost time is 13.525130987167358, memory usage is 4.392463684082031
count is 6700000, cost time is 13.798184633255005, memory usage is 4.467185974121094
count is 6800000, cost time is 16.133020877838135, memory usage is 4.513973236083984
count is 6900000, cost time is 20.654539108276367, memory usage is 4.537406921386719
count is 7000000, cost time is 15.181331872940063, memory usage is 4.617683410644531
count is 7100000, cost time is 16.90074348449707, memory usage is 4.6607208251953125
count is 7200000, cost time is 15.26277780532837, memory usage is 4.6886749267578125
count is 7300000, cost time is 13.590909719467163, memory usage is 4.7701873779296875
count is 7400000, cost time is 17.623094081878662, memory usage is 4.812957763671875
count is 7500000, cost time is 14.904731035232544, memory usage is 4.8453521728515625
count is 7600000, cost time is 16.52383327484131, memory usage is 4.992897033691406
count is 7700000, cost time is 14.730050325393677, memory usage is 4.961498260498047
count is 7800000, cost time is 14.83224892616272, memory usage is 4.986965179443359
count is 7900000, cost time is 16.819100856781006, memory usage is 5.141094207763672
count is 8000000, cost time is 16.299737691879272, memory usage is 5.108722686767578
count is 8100000, cost time is 15.587513208389282, memory usage is 5.14031982421875
count is 8200000, cost time is 19.151288747787476, memory usage is 5.296863555908203
count is 8300000, cost time is 15.674288511276245, memory usage is 5.3394622802734375
count is 8400000, cost time is 16.563526153564453, memory usage is 5.292533874511719
count is 8500000, cost time is 20.42433261871338, memory usage is 5.447917938232422
count is 8600000, cost time is 15.694331884384155, memory usage is 5.412452697753906
count is 8700000, cost time is 20.2867329120636, memory usage is 5.571533203125
count is 8800000, cost time is 18.203043222427368, memory usage is 5.532035827636719
count is 8900000, cost time is 16.625596523284912, memory usage is 5.628833770751953
count is 9000000, cost time is 23.0804705619812, memory usage is 5.652252197265625
count is 9100000, cost time is 17.696472883224487, memory usage is 5.745880126953125
count is 9200000, cost time is 15.72276496887207, memory usage is 5.705802917480469

UPDATE2

Code for normalization the data(small integer and categorical)

last_time = time.time()
dtypes = {"somecount":'int32',"somecount":"int32","somecate":"category","somecount":"int32","somecate":"category","somecount":"int32","somecount":"int32","somecate":"category"}
for i in cursor:
    del i['something']
    del i['sometime']
    del i['something']
    del i['something']
    del i['someint']
    dataset.append(i)
    del i
    count += 1
    if count % 100000 == 0:
        temp = pd.DataFrame(dataset,columns=dataset[0].keys())
        temp.fillna(0,inplace=True)
        temp = temp.astype(dtypes, errors="ignore")
        dataset = []
        data = pd.concat([data,temp])

Metrics for the optimization:

The total memory usage reduced almost half of above one. But the concating/appending time didn't change much.

data length is 37800000,count is 37800000, cost time is 132.23220038414001, memory usage is 11.789329528808594
data length is 37900000,count is 37900000, cost time is 65.34806060791016, memory usage is 11.7882080078125
data length is 38000000,count is 38000000, cost time is 122.15527963638306, memory usage is 11.804153442382812
data length is 38100000,count is 38100000, cost time is 47.79928374290466, memory usage is 11.828723907470703
data length is 38200000,count is 38200000, cost time is 49.70282459259033, memory usage is 11.837543487548828
data length is 38300000,count is 38300000, cost time is 155.42868423461914, memory usage is 11.895767211914062
data length is 38400000,count is 38400000, cost time is 105.94551157951355, memory usage is 11.947330474853516
data length is 38500000,count is 38500000, cost time is 136.1993544101715, memory usage is 12.013351440429688
data length is 38600000,count is 38600000, cost time is 114.5268976688385, memory usage is 12.013912200927734
data length is 38700000,count is 38700000, cost time is 53.31018781661987, memory usage is 12.017452239990234
data length is 38800000,count is 38800000, cost time is 65.94741868972778, memory usage is 12.058589935302734
data length is 38900000,count is 38900000, cost time is 42.62899565696716, memory usage is 12.067787170410156
data length is 39000000,count is 39000000, cost time is 57.95372486114502, memory usage is 11.979434967041016
data length is 39100000,count is 39100000, cost time is 62.12286162376404, memory usage is 12.026973724365234
data length is 39200000,count is 39200000, cost time is 80.76535606384277, memory usage is 12.111717224121094
smci
  • 32,567
  • 20
  • 113
  • 146
Tony Wang
  • 971
  • 4
  • 16
  • 33
  • This approach is called using **chunks**. Better to rename `dataset` to `chunk`, because that's what it is. Anyway, iteratively `append`ing to a list is known to be very slow and memory-intensive (it involves copying). – smci Mar 10 '19 at 06:45

3 Answers3

3

What is in a CSV and what is in a dataframe are two very different things. For example, 9.9 and 9.99999999999999 in a CSV will take the same amount of space in a dataframe.

That said, data in a dataframe takes up much less space than data in a list. Constructing a list is expensive in memory; and appending to a dataframe requires pandas to make a new (bigger) dataframe, copy everything, then leave the original dataframe to be garbage-collected.

You would probably do much better if you preallocated a dataframe of 60000 rows (or however many rows you have in total); e.g.:

data = pd.DataFrame(np.empty((60000,), dtype=[
    ('x', np.uint8),
    ('y', np.float64)
]))

and then for each row inserted data for that row, without relying on dataset list:

data.values[count,:] = rowdata_at_count

This is not type-safe, but it is pretty fast (since no allocations are occuring), so make sure rowdata_at_count is a list whose elements correspond with column types.

EDIT

concat is [more] efficient than append

Yes, appending 100 lines is, I believe, like 100 concats of one line (as each append must reallocate and copy the table, just like concat). Preallocating avoids both append and concat: the table size does not change, no reallocation and copying needs to be done.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Thanks for your detailed answer. I know the difference between csv and dataframe. So I posted a metric for these two things. Size of csv file is 7.1G, and dataframe is 6G. I think the gap is reasonable for such large dataset. But memory usage is 28.7G, so weird. What kinds of process or data consumed the last 21G memory? Thanks for your advice about appending data to dataframe, I will run some test. – Tony Wang Dec 11 '17 at 12:16
2

This problem solved by hdf5 and pytables very very efficiently.

1. define a description:

from tables import *
class Description(IsDescription):
    something1 = StringCol(30)
    somecount1 = Int32Col()
    somecount2 = Int32Col()
    something2 = StringCol(10)
    somecount3 = Int32Col()
    something3 = StringCol(20)
    somecount4 = Int32Col()
    somecount5 = Int32Col()
    something4 = StringCol(29)
    sometime = Time64Col()

2. Generate a hdf5 file with pytables

h5file = open_file("filename.h5", mode='w', title = "title_of_filename")
group = h5file.create_group("/", 'groupname', 'somethingelse')
table = h5file.create_table(group, 'readout', Description, "Readout example")
particle = table.row

3. loop the cursor and insert data to table

for i in cursor:
    try:

        particle['something1'] = i['something1']
            ...
        particle['sometime'] = i['sometime']
        particle.append()
        count += 1
        if count % 100000 == 0:
            current_time = time.time()
            cost_time = current_time - last_time
            last_time = current_time
            memory_usage = psutil.virtual_memory().used / (1024**3)
            print("count is {}, cost time is {}, memory usage is {}".format( count, cost_time, memory_usage))
    except Exception as e:
        print(e)
        print(i)
        break

4. By now, all the data from MongoDB is stored into a hdf5 file in local disk. The final h5 size is 4.6G.

count is 100000, cost time is 61.384639501571655, memory usage is 0.6333351135253906
count is 200000, cost time is 1.8020610809326172, memory usage is 0.6135673522949219
count is 300000, cost time is 2.348151206970215, memory usage is 0.6422805786132812
count is 400000, cost time is 1.768083095550537, memory usage is 0.6340789794921875
count is 500000, cost time is 1.7722208499908447, memory usage is 0.6187820434570312
count is 600000, cost time is 2.418192148208618, memory usage is 0.6522865295410156
count is 700000, cost time is 1.8863332271575928, memory usage is 0.6428298950195312
count is 800000, cost time is 1.8162147998809814, memory usage is 0.6209907531738281
count is 900000, cost time is 2.2260451316833496, memory usage is 0.6533966064453125
count is 1000000, cost time is 1.644845962524414, memory usage is 0.6412544250488281
count is 1100000, cost time is 1.7015583515167236, memory usage is 0.6193504333496094
count is 1200000, cost time is 2.2118935585021973, memory usage is 0.6539993286132812
count is 1300000, cost time is 1.732633352279663, memory usage is 0.6396903991699219
count is 1400000, cost time is 1.7652947902679443, memory usage is 0.6167755126953125
count is 1500000, cost time is 2.49992299079895, memory usage is 0.6546707153320312
count is 1600000, cost time is 1.9869158267974854, memory usage is 0.6390419006347656
count is 1700000, cost time is 1.8238599300384521, memory usage is 0.6159439086914062
count is 1800000, cost time is 2.2168307304382324, memory usage is 0.6554222106933594
count is 1900000, cost time is 1.7153246402740479, memory usage is 0.6401138305664062

5. Metrics of loading data to h5.

The last you need to do is loading the hdf5 file into pandas dataframe with pd.DataFrame.from_records(table.read()). The total memory usage is 10G and time cost is less than half an hour. A single laptop (MBP 16G Memory, running in a docker container) could solved this much more efficiently.

Tony Wang
  • 971
  • 4
  • 16
  • 33
0

If you saved your file to csv, you can latter use "read_csv" with parameter "memory_map=True".
You might also select a subset of columns from the start instead of dropping them later, and read these just in time if you only need them for a few parameters.
You might convert textual repeated/categorical data to dummies/integer instead.
Maybe you might also want to use Numpy instead if you can obtain an all same data type table.
Together with the use of sparse matrices, it might help reduce significantly your in memory size and speed your loading and processing.

For operations of fusion, merge is told by Pandas docs to be "high performance", compared to concat and so (append predating these).
You might want to use inplace=True parameters to avoid the burden of copy

Ando Jurai
  • 1,003
  • 2
  • 14
  • 29