-1

I have few sentences, like,

the film was nice.  
leonardo is great. 
it was academy award.

Now I want them to be tagged with some standards which may look like,

the DT film NN was AV nice ADJ
leonardo NN is AV great ADJ
it PRP was AV academy NN award NN

I could do it but my goal is to see it as,

[[('the','DT'),('film', 'NN'),('was','AV'),('nice','ADJ')],[('leonardo','NN'),('is','AV'),('great','ADJ')],[('it','PRP'),
('was','AV'),('academy','NN'),('award','NN')]]

that is a list of lists where in each list there is a set of tuples. I could solve each one like I am getting one list with tuples, but not all within one. I wrote the below code,

def entity_tag():
    a1=open("/python27/EntityString.txt","r")
    a2=a1.read().lower().split()
    print "The Original String in List form for Comparison:",a2
    a3=open("/python27/EntityDict1.txt","r")
    a4=a3.read().split()
    list1=[]
    list2=[]
    for word in a2:
        if word in a4:
            windex=a4.index(word)
            windex1=windex+1
            word1=a4[windex1]
            word2=word+" "+word1+"$"
            list1.append(word2)
        elif word not in a4:
            word3=word+" "+"NA"+"$"
            list1.append(word3)
        else:
            print "None"
    print list1
    string1=" ".join(list1)
    print string1
    stringw=string1.split("$")
    print stringw
    for subword in stringw:
        #print subword
        subword1=subword.split()
        #print subword1
        subwordt=tuple(subword1)
        #print subwordt
        list2.append(subwordt)
    print "The Tagged Word in list:",list2

As it is PoS Tagging so I am not being able to use zip. If any one may please suggest.

I am using Python2.7.11 on MS-Windows 10.

Coeus2016
  • 355
  • 4
  • 14
  • Surely sir. Editing the question. – Coeus2016 Feb 29 '16 at 17:08
  • Possible duplicate of [Iterating over every two elements in a list](http://stackoverflow.com/questions/5389507/iterating-over-every-two-elements-in-a-list) – alvas Feb 29 '16 at 17:09
  • Perhaps this is not duplicate, I am looking slightly different-each sentence tagged and inserted into list and each list of each sentence forming list of lists. – Coeus2016 Feb 29 '16 at 17:18

2 Answers2

1

See https://stackoverflow.com/a/5394908/610569

>>> x = "the DT film NN was AV nice ADJ leonardo NN is AV great ADJ it PRP was AV academy NN award NN".split()
>>> zip(x,x[1:])[::2]
[('the', 'DT'), ('film', 'NN'), ('was', 'AV'), ('nice', 'ADJ'), ('leonardo', 'NN'), ('is', 'AV'), ('great', 'ADJ'), ('it', 'PRP'), ('was', 'AV'), ('academy', 'NN'), ('award', 'NN')]
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thank you for your effort but it is giving [('t', 'h'), ('e', ' '), ('D', 'T'), (' ', 'f'), ('i', 'l'), ('m', ' '), ('N', 'N'), (' ', 'w'), ('a', 's'), (' ', 'A'), ('V', ' '), ('n', 'i'), ('c', 'e'), (' ', 'A'), ('D', 'J'), (' ', 'l'), ('e', 'o'), ('n', 'a'), ('r', 'd'), ('o', ' '), ('N', 'N'), (' ', 'i'), ('s', ' '), ('A', 'V'), (' ', 'g'), ('r', 'e'), ('a', 't'), (' ', 'A'), ('D', 'J'), (' ', 'i'), ('t', ' '), ('P', 'R'), ('P', ' '), ('w', 'a'), ('s', ' '), ('A', 'V'), (' ', 'a'), ('c', 'a'), ('d', 'e'), ('m', 'y'), (' ', 'N'), ('N', ' '), ('a', 'w'), ('a', 'r'), ('d', ' '), ('N', 'N')] – Coeus2016 Feb 29 '16 at 18:48
  • Read the example carefully. – alvas Feb 29 '16 at 19:50
  • Please see >>> reader = TaggedCorpusReader('.', r'.*\.pos') >>> reader.words() [u"[('They',", u"'PRP'),", u"('refuse',", u"'VBP'),", ...] >>> reader.tagged_sents() may it be a solution? – Coeus2016 Feb 29 '16 at 19:51
  • Read http://stackoverflow.com/a/5394908/610569 carefully. Understand it, don't just copy and paste. – alvas Feb 29 '16 at 19:56
0

If your tagged strings is like this, as you wrote:

the DT film NN was AV nice ADJ
leonardo NN is AV great ADJ
it PRP was AV academy NN award NN

then you can do:

[zip(*[iter(line.split())] * 2) for line in lines]

where lines represent an [iterable] of sentences.

Output:

[[('the', 'DT'), ('film', 'NN'), ('was', 'AV'), ('nice', 'ADJ')],
 [('leonardo', 'NN'), ('is', 'AV'), ('great', 'ADJ')], 
 [('it', 'PRP'), ('was', 'AV'), ('academy', 'NN'), ('award', 'NN')]]


If you have a unique string of words to tag and after you tagged them, nothing has already povided you informations about how to split the string to obtain a list of sentences you have no longer the chance to get it, except from a default length that you can still use as a separator. So, you need to decide how your code has to handle your string or the list of words or the final list of word, tag, to keep each sentence separated... you need a separator

  • Thank you for your time. But it is giving slight different output as,>>> lines="the DT film NN was AV nice ADJ leonardo NN is AV great ADJ it PRP was AV academy NN award NN" >>> [zip(*[iter(line.split())] * 2) for line in lines] [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []] – Coeus2016 Feb 29 '16 at 18:42
  • Yeah, if u have a one big string... In that case, u must know the length of each one to generate a list of list of tuples for each sentence. Before you updated the post I had supposed you read each sentence from a file with readline() or from an iterator and then that u already had a structured data. – Nicholas Roveda Feb 29 '16 at 20:43
  • If u use readline(), u can split('\n') for lines and then use the code above, otherwise u can iterate directly your [iterable]. If u mean to do it with a unique [iterable], without knowing the length of each sentence, you can't, but u can still get a list of pais (word, tag). Try all the code we suggested, explain better what u need and post some output, so we can help u. – Nicholas Roveda Feb 29 '16 at 20:54
  • ur codes r producing, how to integrate for line in a1: line1=line.lower().split() for word in line1: if word in a4: windex=a4.index(word) windex1=windex+1 word1=a4[windex1] word2=word+" "+word1 list1.append(word2) elif word not in a4: word3=word+" "+"NA" list1.append(word3) else: print "None" string1=" ".join(list1) string2=string1.split("\n") tag_list=[zip(*[iter(line.split())] * 2) for line in string2] – Coeus2016 Mar 01 '16 at 19:59
  • @SubhabrataBanerjee What are you trying to achive exactly? If I debugged correctly your code, you have two files, one that cointains a series of word separated by a whitespace and one dictionary with a sequence of word space tag, so you reaad the files and the outputs to get a list of words to tag and a list of word, tag, word, tag. For each word you check if there's a copy of it and then you retrive the tag (index + 1), so that at the end you got you list of word space tag $. At this point you split the list again and for each pairs word, tag you make a tuple to insert in a unique list. – Nicholas Roveda Mar 01 '16 at 20:20
  • @SubhabrataBanerjee Why you create the list and the you split the list again to make a new list of tuples? If you want only to memorize the pairs word, tag you can store them directly in a dict or append them as tuples in one step in your list. If you want to obtain an output like the one you posted, you have to read each sentences from the file with readlines, if they are separated by a new line feed or carry return or you can split each one by a separator you had set before or, the last chance, you have to know the legth of each sentence. – Nicholas Roveda Mar 01 '16 at 20:26
  • @SubhabrataBanerjee Otherwise, you can't obtain a list of list of tuples from one unique big string, only a list of tuples, **if you want to keep each sentence separeted, you need a separator**, also a logic one, like informations about the length of each sentences, or a word that indicate the end of each one. If you need more info, ask, and please clean a bit your code, try to use meaningful variables name and provide some comments, this will help you a lot. – Nicholas Roveda Mar 01 '16 at 20:31
  • I am trying to achieve the output your example is giving, but my input is file with lot of sentences, sort of corpus. I am trying to train a tagged corpus in NLTK. Presently I am achieving by brown.tagged_sents() with TaggedCorpusReader from nltk.corpus.reader but I am looking for a Python work out. I am working around. Lets see. – Coeus2016 Mar 02 '16 at 14:19
  • @SubhabrataBanerjee A lot of sentences? Please provide an exact copy of an input file... – Nicholas Roveda Mar 03 '16 at 20:36