2

I have a huge dataframe. Following a groupby operation, I have a list of strings corresponding to every element of the first column. What I need is to be able to quickly find common strings between some particular i'th row with all the other rows. I could do that in Pandas by saving the above dataframe as a pickle file. The solution was suboptimal as loading takes a very long time.

I then found polars to be promising, except that I cannot store the dataframe with column of sets in any format that it supports for quick loading. So that leaves the alternate solution of storing as a list but quickly converting the grouped column to sets after loading from parquet. (I faced the same problems with datatables and vaex too.)

The solution with polars that I found was to use .apply. But it works in a single thread and is very slow. The code I used was as follows:

>>> df = pl.read_csv('test.csv')
>>> df
shape: (4, 2)
┌────────┬────────┐
│ ColA   ┆ ColB   │
│ ---    ┆ ---    │
│ str    ┆ str    │
╞════════╪════════╡
│ apple  ┆ boy    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ orange ┆ ball   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ apple  ┆ bamboo │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ orange ┆ bull   │
└────────┴────────┘
>>> df = (df.lazy().groupby('ColA').agg([pl.col('ColB').list()])).collect()
>>> df
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA   ┆ ColB              │
│ ---    ┆ ---               │
│ str    ┆ list[str]         │
╞════════╪═══════════════════╡
│ orange ┆ ["ball", "bull"]  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ ["boy", "bamboo"] │
└────────┴───────────────────┘
>>> df['ColB'] = df['ColB'].apply(set)
>>> df
shape: (2, 2)
┌────────┬───────────────────┐
│ ColA   ┆ ColB              │
│ ---    ┆ ---               │
│ str    ┆ object            │
╞════════╪═══════════════════╡
│ orange ┆ {'ball', 'bull'}  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ {'boy', 'bamboo'} │
└────────┴───────────────────┘
>>>

I found discussion on using map, but it works on series only. Unlike in that example that worked on per element basis, when I used np.asarray to convert to numpy array (to apply intersect on them later), entire columns collpased into arrays!

>>> df = (df.lazy().groupby('ColA').agg([pl.col('ColB').list()])).collect()
>>> df
shape: (2, 2)
┌────────┬─────────────────────────┐
│ ColA   ┆ ColB                    │
│ ---    ┆ ---                     │
│ str    ┆ list[str]               │
╞════════╪═════════════════════════╡
│ orange ┆ ["ball", "bull", "boy"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ ["boy", "bamboo"]       │
└────────┴─────────────────────────┘
>>> df.select([pl.all().map(np.asarray)])
shape: (1, 2)
┌────────────────────┬─────────────────────────────────────┐
│ ColA               ┆ ColB                                │
│ ---                ┆ ---                                 │
│ object             ┆ object                              │
╞════════════════════╪═════════════════════════════════════╡
│ ['orange' 'apple'] ┆ [array(['ball', 'bull', 'boy'], ... │
└────────────────────┴─────────────────────────────────────┘
>>>

I would like to know where I went wrong, and how to use multi-threads (as with map) to convert a column of list to a column of numpy array (or preferably sets).

Quiescent
  • 1,088
  • 7
  • 18

1 Answers1

0

Not perhaps the best approach, but the following worked reasonably well.

>>> my_dict = dict(df.to_numpy().tolist())
>>> my_dict
{'orange': array(['ball', 'bull', 'boy'], dtype=object), 'apple': array(['boy', 'bamboo'], dtype=object)}
>>> for i in my_dict:
...     my_dict[i] = set(my_dict[i])
...
>>> my_dict
{'orange': {'ball', 'boy', 'bull'}, 'apple': {'bamboo', 'boy'}}
Quiescent
  • 1,088
  • 7
  • 18