0

I am trying to remove sentences from corpus which are longer(>25 tokens) and shorter(<4 tokens) and also remove sentence that contains rare words that appears less than 8 times. I am trying to remove it but I get error message or empty list every time I tried. corpus is Brown corpus.

lens = [w for w in corpus.sents() if len(w)>=25 and len(w)<= 4]

I get empty list as output

out: []

I am also not sure how to include rare word in this list comprehension. Do I have to convert into FreqDist??

how to remove sentences that are very long, very short and have rare words ? I am confused. Does anyone know and can explain how to do it?? it will be much appreciated :)

jay.andrea
  • 57
  • 8
  • 1
    `len(w)>=25 and len(w)<= 4` - you can't have a word whose length is less than 4 _and_ greater than 25 _at the same time_. – ForceBru Mar 07 '21 at 12:42
  • I think you meant something like `lens = [w for w in corpus.sents() if 4 <= len(w) <= 25]` – yudhiesh Mar 07 '21 at 12:45
  • @ForceBru oh okay, how to do it then? do it separately ?? also how to include rare words less than 8 times? – jay.andrea Mar 07 '21 at 13:11
  • @ yudhiesh is ```4 <= len(w) <= 25``` or ```4 > len(w) > 25``` though I get empty list again for the second one. – jay.andrea Mar 07 '21 at 13:13
  • @jay.andrea `4 > len(w) > 25` is lengths of w that are greater than 4 and 25 which is not possible. `4 <= len(w) <= 25` keeps the words that are less than 25 and greater than 4 which is what you are looking for. – yudhiesh Mar 07 '21 at 14:43
  • @yudhiesh oohh I got it now. :) thanks. but how can I also include rare words that appear less than 8 times in a sentence? ``` rare = [i for i in lens if i <= 10] ``` I got this error ``` not supported between instances of 'list' and 'int ``` which is right, since "lens" is a list. how can I put all together? any suggestion? – jay.andrea Mar 07 '21 at 15:09
  • I added two methods that work please do accept and upvote them. On removing the rare words in a sentence you should open another questions for that. I can try and answer it. – yudhiesh Mar 07 '21 at 15:18

1 Answers1

1

You can do it like so where you only keep the words that have a length of less than 26 and a length of more than 3.

a = ["hello world", "how are you doing","where are you going?", "welcome to the greatest show on earth! How will you manage to gain all the experience needed for this to show?","hi"]
[len(w) for w in a]
>>>[11, 17, 20, 110,2]

Method 1:

list(filter(lambda x: 4 <= len(x) <= 25, a))
>>>['hello world', 'how are you doing', 'where are you going?']

Method 2:

[x for x in a if 4 <= len(x) <= 25]
>>>['hello world', 'how are you doing', 'where are you going?']
yudhiesh
  • 6,383
  • 3
  • 16
  • 49