Optimise Pandas.series conversion to a set

Question

Hi I am working on a data transforming project. I am taking in a csv that has 1 million records and trying to segregate them into individual txt files. The problem is that it takes a lot of time to process. We're talking more that 5 mins for each column here. My code is below:

import pandas as pd

print("Reading CSV")
data_set = pd.read_csv(address_file_path, low_memory=False, index_col=1)
print("Reading Completed")
a_name = set(data_set.loc[:, 'A'])
print("A done")
b_name = set(data_set.loc[:, 'B'])
print("B Done")
c_name = set(data_set.loc[:, 'C'])
print("C Done")
d_name = set(data_set.loc[:, 'D'])
print("D done")
e_name = set(data_set.loc[:, 'E'])
print("E done")
f_name = set(data_set.loc[:, 'F'])
print("F done")

print("Data Transformed")

It does A quite quickly considering that the Pandas.Series has 1 million records but the repetition is such that it turns out to be only 36 entries but then it gets stuck I am not even sure the code finishes since I haven't seen it finish uptil now. How can I optimise it to work faster?

Unnamed: 0                           1
A                           NaN
B               Job Applicant;UP
C              USA
D                       California
E      Alice neh tratma
F                      Bharuhana
I                           NaN
J                           NaN
K                            SH 5
L                            NaN
M                            NaN
N                            NaN
O                            NaN
P                            NaN
Q                            NaN

If you are only writing to a file, try `pd.unique`, which is faster than set builder. — ayhan, Jan 08 '18 at 16:08
Yeah. Either `pd.unique(df['A'])` or `df['A'].unique()`. You can continue using loc too, of course. — ayhan, Jan 08 '18 at 16:11
@ayhan but what if I want to perform some operations on it before I write it down? For e.g. If I want to split some items based on ` ; ` and then write it to a file. How do I go about it then? — iam.Carrot, Jan 08 '18 at 16:13
@iam.Carrot take a look at this question, it might help https://stackoverflow.com/questions/48129713/fastest-way-to-find-all-unique-elements-in-an-array-with-cython — Bharath M Shetty, Jan 08 '18 at 16:14
@DSM I am running a 16GB ram system with intel core i5 7thGen. My memory utilization by pycharm goes upto 1 GB. — iam.Carrot, Jan 08 '18 at 16:15
Then something's definitely wrong; the set operation should take only seconds. Please try to create a [mcve]. — DSM, Jan 08 '18 at 16:17
@DSM that's all there is to the code. I can't share my dataset since it's way too huge and sized. I don't understand the Minimal, Complete and Verifiable example. — iam.Carrot, Jan 08 '18 at 16:19
@DSM I've added the read CSV part just incase it's of any relevance — iam.Carrot, Jan 08 '18 at 16:20
@iam.Carrot can you add atleast `data_set.loc[0, 'A']` so we can see what kind of data we are dealing with. If its a pandas question no data no use — Bharath M Shetty, Jan 08 '18 at 16:20
@iam.Carrot: please reread the MCVE page, and follow it up with this Q/A on [good pandas questions](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). We can't help with what we can't reproduce. — DSM, Jan 08 '18 at 16:22
@DSM I've added the first entry of the dataset. All columns contain text (a bunch of strings) — iam.Carrot, Jan 08 '18 at 16:27
@DSM great, thank god it provides more insight since I had literally have nothing else for it to be a MCVE. Please share where am I going wrong? — iam.Carrot, Jan 08 '18 at 16:32

DSM · Accepted Answer · 2018-01-08T16:45:32.230

2

I think you're hitting this problem:

In [17]: s = pd.Series([np.nan]*3)

In [18]: set(s)
Out[18]: {nan, nan, nan}

In [19]: s[0]
Out[19]: nan

In [20]: s[0] is s[1]
Out[20]: False

nan can be a quirky little beast, because it's not equal to itself, and in this context the returned values aren't even identical to each other. This is a terrible case for the hash lookup.

Either drop the nans manually or use .unique(). After

df = pd.DataFrame(np.random.random((10**6,6)))
df.iloc[::2] = np.nan

I get

In [26]: %time z = set(df[0].dropna())
CPU times: user 128 ms, sys: 40 ms, total: 168 ms
Wall time: 174 ms

In [27]: %time z = df[0].unique()
CPU times: user 72 ms, sys: 12 ms, total: 84 ms
Wall time: 88.9 ms

and I gave up waiting for set(df[0]) to finish.

In general, you're going to want to avoid mixing Python-level and numpy/pandas-level functions. The latter tend to be faster even in cases where you're not hitting this weird corner case.

In your code, you could do

set(data_set.loc[:, 'A'].unique())

if you want to keep the NaN or set(data_set.loc[:, 'A'].dropna().unique()) if you don't. For many purposes you wouldn't even need to convert to a set, but if you're more familiar with its methods you can keep it.

edited Jan 08 '18 at 16:45

answered Jan 08 '18 at 16:33

DSM

342,061
65
592
494

great! would you be able to edit a line of code from my question as I am a little new to the python environment. It'll be of great help – iam.Carrot Jan 08 '18 at 16:41
Also we can do `set(df["0"].astype(str))` since OP has object columns, though it cant beat the `unique()` – Bharath M Shetty Jan 08 '18 at 16:43
Great! It works perfectly. Thanks for the help. I shall remember it. Dropping the `NaN` manually wouldn't have been great as I'd like to keep my code immune to such datasets. the `unique()` works perfectly well. I didn't get the `.dropna().unique()` in terms of what purpose does it serve? – iam.Carrot Jan 08 '18 at 16:48
Er, `dropna` drops the nans, otherwise you'll wind up with a NaN in your set and not just strings. – DSM Jan 08 '18 at 16:50

Optimise Pandas.series conversion to a set

1 Answers1