1

I have the following output that I got by using nltk .tokenize(), .pos_tag(), and wordnet .synsets(). The output is a list of lists of potential matches for each token of document and wordnet's own part-of-speech tagging (here we have 4 tokens, hence, 4 lists of matches):

[[Synset('document.n.01'),
  Synset('document.n.02'),
  Synset('document.n.03'),
  Synset('text_file.n.01'),
  Synset('document.v.01'),
  Synset('document.v.02')],
 [Synset('be.v.01'),
  Synset('be.v.02'),
  Synset('be.v.03'),
  Synset('exist.v.01'),
  Synset('be.v.05'),
  Synset('equal.v.01'),
  Synset('constitute.v.01'),
  Synset('be.v.08'),
  Synset('embody.v.02'),
  Synset('be.v.10'),
  Synset('be.v.11'),
  Synset('be.v.12'),
  Synset('cost.v.01')],
 [Synset('angstrom.n.01'),
  Synset('vitamin_a.n.01'),
  Synset('deoxyadenosine_monophosphate.n.01'),
  Synset('adenine.n.01'),
  Synset('ampere.n.02'),
  Synset('a.n.06'),
  Synset('a.n.07')],
 [Synset('trial.n.02'),
  Synset('test.n.02'),
  Synset('examination.n.02'),
  Synset('test.n.04'),
  Synset('test.n.05'),
  Synset('test.n.06'),
  Synset('test.v.01'),
  Synset('screen.v.01'),
  Synset('quiz.v.01'),
  Synset('test.v.04'),
  Synset('test.v.05'),
  Synset('test.v.06'),
  Synset('test.v.07')]]

If I want to write a function (a loop, possibly) to extract only the first match for each token and generate the output as a new list, such as the following (using the example above):

[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]

What's the most flexible way to write such a function? So that it can be extended to other tokenized documents (with pos tagging)?

Thank you.

Chris T.
  • 1,699
  • 7
  • 23
  • 45

2 Answers2

3

Use a list comprehension.

[token[0] for token in data if token]
Zach Gates
  • 4,045
  • 1
  • 27
  • 51
  • Can I make this into a list? I was running your code under mine, but it returns an error message 'list index out of range' – Chris T. Aug 30 '17 at 14:50
  • @ChrisT.: If `data` is equal to your list of tokens, I'm not sure why you're seeing that error. – Zach Gates Aug 30 '17 at 14:52
  • I think this has something to do with token like 'this' in the original text, because when using wn.synsets() to get syn reference for such token, it returns [ ]. And if using token[0] to represent [ ][0], it generates an error. Is there anyway to modify the function to skip 'no match' tokens? – Chris T. Aug 30 '17 at 14:55
  • @ChrisT.: There absolutely is. I've added this check into the list comprehension (the `if token` portion). – Zach Gates Aug 30 '17 at 14:56
  • Thanks again for your instant reply. I will check that again. – Chris T. Aug 30 '17 at 14:56
  • this worked. Thanks again for helping on this. I am opening a new question thread, which is a follow-up of this one (I don't want to make this question thread too long). I seem to get really confused about list comprehension. You are more than welcome to give comment on that one when you have time. – Chris T. Aug 30 '17 at 15:45
1

So I will solve a example to loop into list of such type, you can try the same with yours.

    a=[[1,2,3],[4,5,6],[7,8,9]]
    for x in a:
        print(x[0])
   Output looks like:
   1
   4
   7
Chetan_Vasudevan
  • 2,414
  • 1
  • 13
  • 34
  • I have tried using list.append() on my output, but it returns an error message 'list index out of bound'. But thanks for answering this. – Chris T. Aug 30 '17 at 04:44
  • Why do you want to use list.append() , the question you asked was just to print the output in the above required way right ? So did the above one not satisfy your requirement ? let me know if you want to do something more on it – Chetan_Vasudevan Aug 30 '17 at 04:48
  • I used list.append() because I am defining a function that will return a list object. To be sure, print will also do the trick. – Chris T. Aug 30 '17 at 04:55
  • So before for loop create a empty list say result=[] and then instead of print(x[0]) do (result.append(x[0]) and then you can return your result at the end of your function – Chetan_Vasudevan Aug 30 '17 at 05:00