2

Is there a smart list/dictionary comprehension way of getting the intended output below give the following:

import numpy as np
freq_mat = np.random.randint(2,size=(4,5));
tokens = ['a', 'b', 'c', 'd', 'e'];
labels = ['X', 'S', 'Y', 'S'];

The intended output for freq_mat

array([[1, 0, 0, 1, 1],
       [0, 0, 0, 0, 1],
       [1, 0, 1, 1, 0],
       [0, 1, 0, 0, 0]])

should like the following:

[({'a': True, 'b': False, 'c': False, 'd': True, 'e': True}, 'X'),
 ({'a': False, 'b': False, 'c': False, 'd': False, 'e': True}, 'S'),
 ({'a': True, 'b': False, 'c': True, 'd': True, 'e': False}, 'Y'),
 ({'a': False, 'b': True, 'c': False, 'd': False, 'e': False}, 'S')]
Kam
  • 87
  • 1
  • 10
  • 1
    have a look here [https://stackoverflow.com/a/1747827/7352806] – Narendra Feb 14 '18 at 04:23
  • Could you explain what it is you're trying to do? – cs95 Feb 14 '18 at 04:29
  • Something is odd with your original code: you're setting `d[key] = val>0` repeatedly for the same `key` but different `val`. This either not doing what you want or it's wasting a lot of work. What do you expect `featureset` to look like? – Nathan Vērzemnieks Feb 14 '18 at 05:08

2 Answers2

1

You can collapse that code to:

Code:

featureset = [
    ({key: val > 0 for val in row for key in tokens}, label)
    for row, label in zip(freq_mat, labels)]

Test Code:

freq_mat = np.random.randint(2, size=(4, 5));
tokens = ['a', 'b', 'c', 'd', 'e'];
labels = ['X', 'S', 'Y', 'S'];

featureset2 = []
for row, label in zip(freq_mat, labels):
    d = dict()
    for key in tokens:
        for val in row:
            d[key] = val > 0
    featureset2.append((d, label))

featureset = [
    ({key: val > 0 for val in row for key in tokens}, label)
    for row, label in zip(freq_mat, labels)]

assert featureset == featureset2
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • Like the original code, this does a lot of unnecessary work: `{key:val > 0 for val in row for key in tokens}` is exactly equivalent to `{key:row[-1]>0 for key in tokens}` (if `row` is non-empty, of course). I suspect there's some confusion on the poster's side. – Nathan Vērzemnieks Feb 14 '18 at 05:11
  • Ah, so I was quite close. Thanks for your help Stephen, this works for me. @Nathan, I have added the intended output to the question. Basically, I am working on BoW feature construction from a text feature, stored in a Pandas dataframe. The above should allow me to use the output from CountVectorizer to construct featuresets for NLTK classifiers. This probably is a bit convoluted way of doing it but several other solutions I looked at seemed a bit involved and didn't work for me. Do you have a better suggestion? – Kam Feb 14 '18 at 21:27
  • My apologies friends. You are right Nathan, my code is not doing what it is supposed to do. I have edited my original post to remove confusion. Sorry again for the confusion. @Stephen, I will mark your answer correct if you could kindly amend your response for the revised question. – Kam Feb 14 '18 at 21:48
0

As you note in your updated post, your original code doesn't work quite right: it adds the same value for every key in a given row - all True or all False. The simplest correction to your original code would be this:

featureset = []
for row, label in zip(freq_mat, labels):
    d = dict()
    for key, val in zip(tokens, row): # The critical bit
        d[key] = val>0            
    featureset.append((d,label))

A more streamlined version, but one that's still quite a bit more readable, I think, than the single-comprehension approach:

featureset = []
for row, label in zip(freq_mat, labels):
    d = {key: val > 0 for key, val in zip(tokens, row)}
    featureset.append((d, label))

Or for the one-liner:

featureset = [({key:val>0 for key, val in zip(tokens, row)}, label)
    for row, label in zip(freq_mat, labels)]

Personally I'd probably go with the second approach, a compromise of concision and readability. But that's up to you, of course!

Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23
  • 1
    Thanks Nathan, I just worked it out with a fresh brain this morning :-). Marking your answer correct with sincere apologies to Stephen above. – Kam Feb 14 '18 at 22:19