1

I have a dataset shown below and I am trying to grab each name in the feature column where the importance column is not equal to 0.000000 and put them straight into a list to use straight away. I have tried a few methods but the main two which show promise are as follows:

Method 1

new_features = []

for i in importance_ranking['importance']:
    if i > 0.000000:
        new_features.append(i)
        
new_features

method 1 just grabs me all of the values of the importance column, but I want the feature column value instead so I tried method 2

Method 2

features_to_use = []
for x,y in importance_ranking:
    if y > 0.000000:
        features_to_use.append(x)
        
features_to_use

method 2 throws me the error as follows:

method 2 error

    ValueError                                Traceback (most recent call last)
<ipython-input-1181-d1ec4f141ff9> in <module>()
      1 features_to_use = []
----> 2 for x,y in importance_ranking:
      3     if y > 0.000000:
      4         features_to_use.append(x)
      5 

ValueError: too many values to unpack (expected 2)

any help is greatly appreciated

method 3 and error

    features_to_use = []
for s,x,y in importance_ranking:
    if y > 0.000000:
        features_to_use.append(x)

features_to_use
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1182-8ed92369130e> in <module>()
      1 features_to_use = []
----> 2 for s,x,y in importance_ranking:
      3     if y > 0.000000:
      4         features_to_use.append(x)
      5 

ValueError: too many values to unpack (expected 3)

Dataset

   **feature    importance**
1   src_bytes   0.541433
18  count   0.160338
30  dst_host_diff_srv_rate  0.074743
53  service_bgp 0.066960
31  dst_host_same_src_port_rate 0.045040
28  dst_host_srv_count  0.027176
9   num_compromised 0.016684
25  diff_srv_rate   0.008991
58  service_pm_dump 0.008533
62  service_auth    0.008270
29  dst_host_same_srv_rate  0.006760
2   dst_bytes   0.005153
33  dst_host_serror_rate    0.004642
6   hot 0.003985
32  dst_host_srv_diff_host_rate 0.003330
35  dst_host_rerror_rate    0.002923
34  dst_host_srv_serror_rate    0.002222
87  service_klogin  0.002135
116 flag_SH 0.001553
0   duration    0.001263
7   num_failed_logins   0.001125
22  rerror_rate 0.001011
27  dst_host_count  0.000917
4   wrong_fragment  0.000736
52  service_ntp_u   0.000489
37  flag_RSTOS0 0.000468
3   land    0.000449
111 service_tftp_u  0.000355
19  srv_count   0.000289
8   logged_in   0.000284
... ... ...
16  is_host_login   0.000000
40  service_Z39_50  0.000000
41  service_http_443    0.000000
43  service_other   0.000000
44  protocol_type_tcp   0.000000
45  service_link    0.000000
46  service_X11 0.000000
47  service_exec    0.000000
48  service_red_i   0.000000
49  service_http_2784   0.000000

Line used to create Dataframe

importance_ranking = pd.DataFrame({'feature':all_cols, 'importance':dt.feature_importances_})

pic of dataframe

enter image description here

new_test

#features_to_use = []
a,b = importance_ranking[0]
#for s,x,y in importance_ranking:
 #   if y > 0.000000:
     #   features_to_use.append(x)
#
#features_to_use


KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-1244-5d9e2e614219> in <module>()
      1 #features_to_use = []
----> 2 a,b = importance_ranking[0]
      3 #for s,x,y in importance_ranking:
      4  #   if y > 0.000000:
      5      #   features_to_use.append(x)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0
Community
  • 1
  • 1
ipmev12
  • 77
  • 10
  • Is that first column with numbers a part of the file? Looks like you have 3 values in each row, so unpacking into two would throw an error. – kabanus Apr 28 '18 at 10:13
  • Possible duplicate of ["Too many values to unpack" Exception](https://stackoverflow.com/questions/1479776/too-many-values-to-unpack-exception) – kabanus Apr 28 '18 at 10:14
  • @kabanus yes i think it is, I have tried using method 2 with 3 values in the for statement but still doesnt work – ipmev12 Apr 28 '18 at 10:16
  • Then post that attempt and the error as well. In particular, you only want the last two columns, so something like `_,a,b` looks like it would work. – kabanus Apr 28 '18 at 10:16
  • @kabanus ive added the new attempt as method 3 – ipmev12 Apr 28 '18 at 10:20
  • Looks like the line is still something your not expecting. Clearly there are more than 3 items in the line. I suggest you print the first line to debug this, see what is actually contained `importance_ranking`. I'm not sure if you are using a `dict` or a `dataframe`, but check the shape of your data. – kabanus Apr 28 '18 at 10:23
  • @kabanus ive added the line of code used to create the dataframe and a pic of the dataframe – ipmev12 Apr 28 '18 at 10:33
  • Well that is strange. Looks good. What does `importance_ranking[0]` right before the offending loop show you? Try unpacking that before the loop (e.g. `a,b = importance_ranking[0]`), and if that work, print `x,y` in the loop and see if it's a specific offending row somehow (no idea how). – kabanus Apr 28 '18 at 10:36
  • @kabanus didnt work im afraid, added test error to end of post – ipmev12 Apr 28 '18 at 10:48
  • My bad, I didn't notice it was `pandas` - you can remove that, it's irrelevant. Print the first line. My point is, try and see what you are iterating - maybe don't unpack, and just `for a in ...` - and print in the loop to see what you're getting. You have the best chance of finding the error yourself I'm afraid. – kabanus Apr 28 '18 at 10:50

2 Answers2

2

I think the best idea is to use boolean indexing:

df = importance_ranking[importance_ranking['importance']>0.000000]

and then get features:

features = df.features
Mehdi
  • 1,260
  • 2
  • 16
  • 36
2

DataFrames offer a great way to select the data you want

features_to_use = importance_ranking[importance_ranking['importance'] > 0.0]['importance'].values.tolist()

It may be difficult to understand at a first sight but what you actually do is filter all the importance_rankings that have an importance greater than 0.0 and then select the importance column of the importance_rankings that satisfy this condition. The rest of the line .values.tolist() is just used to unpack your data.

If you feel uncomfortable with this solution you can just try doing it step by step:

df = importance_ranking[importance_ranking['importance'] > 0.0] # Filtered Dataframe
importance_values = df['importance'] # Series Object
features_to_use = importance_values.values.tolist()