Creating a unique list using Pandas

Question

I have an xlsx file with over 1000 columns of data. I would like to firstly parse every second column from the data file (which can contain numbers and letters) and then create a unique list from the parsed data.

I'm a complete noob & have tried a "for" and "do while" loop but neither have worked for me.

So far I have:

    import pandas as pd
    workbook = pd.read_excel('C:\Python27\Scripts\Data.xlsx')
    worksheet = workbook.sheetname='Data'

    for col in range(worksheet[0], worksheet[1300]):
        print(col)

I think I need to append the data and maybe write to a text file then create a unique list from the text file - I can do the second part it's just getting it into the text file I'm having trouble with.

Thanks

Sorry when you say every second column are you meaning just 2nd column or every even column? also are you wanting some global list or a list per column? Also have you looked at [`unique`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html#pandas.Series.unique)? — EdChum, Sep 16 '16 at 14:26
Thanks Ed, it is every even column - 0,2,4,6,8, etc, or A, C, E, G, I all the way to AWK. There are 1286 columns in total, so I would like to make a list out of 643 of those and then create a unique list from that list. I'll definitely use pandas.series.unique() but I'm unsure how to get the 643 columns into one list. — D. Jenkins, Sep 16 '16 at 14:48

EdChum · Accepted Answer · 2016-09-16T15:16:56.627

2

You can iterate over your columns by slicing and using a step arg i.e. df.ix[:, ::2]

In [35]:
df = pd.DataFrame({'a':1, 'b':[1,2,3,4,5], 'c':[2,3,4,5,6], 'd':0,'e':np.random.randn(5)})
df

Out[35]:
   a  b  c  d         e
0  1  1  2  0 -0.352310
1  1  2  3  0  1.189140
2  1  3  4  0 -1.470507
3  1  4  5  0  0.742709
4  1  5  6  0 -2.798007

here we step every 2nd column:

In [37]:
df.ix[:,::2]

Out[37]:
   a  c         e
0  1  2 -0.352310
1  1  3  1.189140
2  1  4 -1.470507
3  1  5  0.742709
4  1  6 -2.798007

we can then just call np.unique on the entire df to get a single array of all the unique values:

In [36]:
np.unique(df.ix[:,::2])

Out[36]:
array([-2.79800676, -1.47050675, -0.35231005,  0.74270934,  1.        ,
        1.18914011,  2.        ,  3.        ,  4.        ,  5.        ,  6.        ])

edited Sep 16 '16 at 15:16

answered Sep 16 '16 at 15:01

EdChum

376,765
198
813
562

Very nice. In general, when calling `numpy` functions, it's not necessary to call `as_matrix`? I vaguely recall that `sklearn` is upset if I don't do this. – Ami Tavory Sep 16 '16 at 15:37
@AmiTavory I thought sklearn was much better these days with respect to handling dfs and series as data args, but quite often I just passed `df.values`, I think `np` plays pretty nicely with dataframes, historically there have been some cases where `np` is not calling `__array__` attribute correctly but recently not encountered it often – EdChum Sep 16 '16 at 15:40
Thanks guys, I'm nearly there. I still seem to be getting all of the data back when I print: df.ix[:, ::2]. The np.unique() should still work though, is that right? – D. Jenkins Sep 16 '16 at 16:17

Creating a unique list using Pandas

1 Answers1