0

I have a huge Dataframe with 8 miliion rows and a algorithm, which works with it in an recursive module. As I would like to prevent the algorithm to load every single time the whole dataframe, I would like to store it in hdf-format, so that I can preselect the information I need, to save memory-space. The problem is, that one column consists of a list in every row, which is way saving leads to the following error: Exception: cannot find the correct atom type -> [dtype->object,items->Index(['Herkunft', 'ID', 'Objekttyp', 'ObjekttypNr', 'Staat', 'aktueller Name', 'art', 'ldName', 'population', 'neuername', 'neuername2', 'kphdist'], dtype='object')]

This leads to my question: is there a way to store a columns out of lists in a hdf-table? It should be available right after loading, without any costly reformatations.

edit: hear are my columns with 10 entities as lists:

Herkunft ['1Wiki', '1Wiki', '1Wiki', '1Wiki']
ID ['http://www.wikidata.org/entity/Q1917863', 'http://www.wikidata.org/entity/Q7165355', 'http://www.wikidata.org/entity/Q7165354', 'http://www.wikidata.org/entity/Q7165337']
Objekttyp [nan, nan, nan, nan]
ObjekttypNr ['nan', 'Dorf', 'Dorf', 'Dorf']
Staat ['ES', 'MY', 'IN', 'CA']
aktueller Name [nan, nan, nan, nan]
art ['http://www.wikidata.org/entity/Q2074737', 'http://www.wikidata.org/entity/Q486972', 'http://www.wikidata.org/entity/Q486972', 'http://www.wikidata.org/entity/Q486972']
latitude [38.840555555, 1.56667, 13.3667, 44.3008]
ldName ['carrícola', 'penunus', 'penumuru', 'pentz, nova scotia']
longitude [-0.471388888, 111.45, 79.1833, -64.3819]
population ['95', nan, nan, nan]
lakurz [38.840555555, 1.56667, 13.3667, 44.3008]
lokurz [-0.471388888, 111.45, 79.1833, -64.3819]
neuername ['carrícola', 'penunus', 'penumuru', 'pentz , nova scotia']
neuername2 ['carricola', 'penunus', 'penumuru', 'pentz , nova scotia']
lettermass [5, 5, 6, 11]
kphdist [['4745'], ['1668'], ['1667'], ['168', '', '', '63', '82']]
  • 2
    Can you include some sample data (like first two lines) for your dataset? And outline the hdf format as you would like to save it? PS: Have a look at [this posting](https://stackoverflow.com/questions/38460744/how-does-one-store-a-pandas-dataframe-as-an-hdf5-pytables-table-or-carray-earr). – Matthias Apr 02 '18 at 12:05
  • 1
    In my opinion, it is inefficient to hold lists in a dataframe series *or* in an HDF5 dataset (even if it were possible). My first port of call would be to see if your data structure can be refactored. – jpp Apr 02 '18 at 12:41
  • @ jpp. if you can recomment a better way, be my guest. I use the list in an function afterwards for itteration. – Eric Radisch Apr 17 '18 at 15:24

0 Answers0