I've recently been using pandas for data analysis, and I'm trying to be properly pythonic about things. The following code works just fine to find all of the unique values in certain subset of columns:
import pandas as pd
dataframe = pd.read_csv("sourcefile.csv", na_values=[" ",""])
col_names = list(dataframe)
my_cols = [name for name in col_names if "STRING" in name]
unique_urls = set()
for col in my_cols:
for url in list(dataframe[col]):
unique_urls.add(url)
But I feel like there is a better way to do the last two nested for
loops. Any advice appreciated!
EDIT: I may have found a better way based on some answers here: Find unique values in a Pandas dataframe, irrespective of row or column location
The following code works:
import pandas as pd
dataframe = pd.read_csv("sourcefile.csv", na_values=[" ",""])
col_names = list(dataframe)
my_cols = [name for name in col_names if "STRING" in name]
unique_urls = pd.unique(dataframe[my_cols].values.ravel())
I did a time test:
In [8]: def unique_items_1():
unique_urls = set()
for col in my_cols:
for item in list(dataframe[col]):
unique_items.add(item)
In [9]: %timeit unique_items_1()
1000 loops, best of 3: 436 µs per loop
In [10]: %timeit unique_items_2 = pd.unique(dataframe[my_cols].values.ravel())
1000 loops, best of 3: 462 µs per loop
And since both take approximately the same amount of time, with the set()
way being slightly faster, I'm still curious as to what the experts think is the best way. Thanks!