1

I have an xlsx file with over 1000 columns of data. I would like to firstly parse every second column from the data file (which can contain numbers and letters) and then create a unique list from the parsed data.

I'm a complete noob & have tried a "for" and "do while" loop but neither have worked for me.

So far I have:

    import pandas as pd
    workbook = pd.read_excel('C:\Python27\Scripts\Data.xlsx')
    worksheet = workbook.sheetname='Data'

    for col in range(worksheet[0], worksheet[1300]):
        print(col)

I think I need to append the data and maybe write to a text file then create a unique list from the text file - I can do the second part it's just getting it into the text file I'm having trouble with.

Thanks

D. Jenkins
  • 23
  • 3
  • Sorry when you say every second column are you meaning just 2nd column or every even column? also are you wanting some global list or a list per column? Also have you looked at [`unique`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html#pandas.Series.unique)? – EdChum Sep 16 '16 at 14:26
  • Thanks Ed, it is every even column - 0,2,4,6,8, etc, or A, C, E, G, I all the way to AWK. There are 1286 columns in total, so I would like to make a list out of 643 of those and then create a unique list from that list. I'll definitely use pandas.series.unique() but I'm unsure how to get the 643 columns into one list. – D. Jenkins Sep 16 '16 at 14:48

1 Answers1

2

You can iterate over your columns by slicing and using a step arg i.e. df.ix[:, ::2]

In [35]:
df = pd.DataFrame({'a':1, 'b':[1,2,3,4,5], 'c':[2,3,4,5,6], 'd':0,'e':np.random.randn(5)})
df

Out[35]:
   a  b  c  d         e
0  1  1  2  0 -0.352310
1  1  2  3  0  1.189140
2  1  3  4  0 -1.470507
3  1  4  5  0  0.742709
4  1  5  6  0 -2.798007

here we step every 2nd column:

In [37]:
df.ix[:,::2]

Out[37]:
   a  c         e
0  1  2 -0.352310
1  1  3  1.189140
2  1  4 -1.470507
3  1  5  0.742709
4  1  6 -2.798007

we can then just call np.unique on the entire df to get a single array of all the unique values:

In [36]:
np.unique(df.ix[:,::2])

Out[36]:
array([-2.79800676, -1.47050675, -0.35231005,  0.74270934,  1.        ,
        1.18914011,  2.        ,  3.        ,  4.        ,  5.        ,  6.        ])
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Very nice. In general, when calling `numpy` functions, it's not necessary to call `as_matrix`? I vaguely recall that `sklearn` is upset if I don't do this. – Ami Tavory Sep 16 '16 at 15:37
  • @AmiTavory I thought sklearn was much better these days with respect to handling dfs and series as data args, but quite often I just passed `df.values`, I think `np` plays pretty nicely with dataframes, historically there have been some cases where `np` is not calling `__array__` attribute correctly but recently not encountered it often – EdChum Sep 16 '16 at 15:40
  • Thanks guys, I'm nearly there. I still seem to be getting all of the data back when I print: df.ix[:, ::2]. The np.unique() should still work though, is that right? – D. Jenkins Sep 16 '16 at 16:17