1

I am working on pulling data from OpenCalais API and here are the details:

Input: Some paragraph (a string e.g. "Barack Obama is the President of United States." Also, what gets returned is some instance variables with offsets and lengths but not necessarily in order of occurrence.

Output (I want): Same string but with the identified entity instances with hyperlinks (which is also a string) i.e.

output="<a href="https://en.wikipedia.org/Barack_Obama"> Barack Obama </a> is the President of ""<a href="https://en.wikipedia.org/United_States"> United States. </a>"

BUT IT IS A PYTHON QUESTION REALLY.

This is what I have

#API CALLS ABOVE WHICH IS NOT RELEVANT. 

output=input
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])
    previdx=0
    idx=0
    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url= "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid']

        except:
            url="https://en.wikipedia.org/wiki/"+result.entities[x]    ["name"].replace(" ", "_")

        print "Generating wiki page link"
        print url+"\n"

 #THE PROBLEM STARTS HERE

         offsetstr=result.entities[x]["instances"][y]["offset"]
         lenstr=result.entities[x]["instances"][y]["length"]

         output=output[:offsetstr]+"<a href=" + url + ">" +   output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:]

print output

Now the issue is, if you read the code properly you'll know that after the first iteration, the output string changes - therefore for subsequent iterations, the offset values no longer applies in the same manner. So, I cannot make the expected change.

Basically trying to get:

input = "Barack Obama is the President of United States"

output= "<a href="https://en.wikipedia.org/Barack_Obama"> Barack Obama </a> is the President of ""<a href="https://en.wikipedia.org/United_States"> United States. </a>." 

How can it be done, I wonder. Tried splicing n dicing but string just gets garbled.

2 Answers2

0

try to use another var to store result

output=input
res,preOffsetstr  = [],0
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])
    previdx=0
    idx=0
    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url= "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid']

        except:
            url="https://en.wikipedia.org/wiki/"+result.entities[x]    ["name"].replace(" ", "_")

        print "Generating wiki page link"
        print url+"\n"

 #THE PROBLEM STARTS HERE

         offsetstr=result.entities[x]["instances"][y]["offset"]
         lenstr=result.entities[x]["instances"][y]["length"]

         res.append(output[preOffsetstr :offsetstr]+"<a href=" + url + ">" +      output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:])


         preOffsetstr = offsetstr
print '\n'.join(res)
galaxyan
  • 5,944
  • 2
  • 19
  • 43
  • Thanks but I did try that. But it's still garbled. It WOULD HAVE worked if the returned offsets were in ascending order but it is not (it's random). – Adnan Firoze Nov 18 '15 at 02:22
  • So using the solution: Input = "Barack Obama goes to Starbucks." Output="Barack Obama went to Starbucks. Barack Obama went to Starbucks." Gets more mixed up as the input gets larger. @galaxian – Adnan Firoze Nov 18 '15 at 02:31
  • @AdnanFiroze I don't know the structure of you input. Why can't you convert input to string? – galaxyan Nov 18 '15 at 02:34
  • The input is a simple string. Couple of sentences. The thing gets garbled because the tags (words to replace) from the API has different sequences of offsets. For example, if "Starbucks" (Offset=22, length=9) is returned and made into the string first (not "Barack Obama" - Offset=1, Length=12), then the thing gets messed up. See the problem? And I do not intend to replace all instances of "Barack Obama" or whatever, just the ones returned by the offset and length. Thanks for your time. Appreciate it. – Adnan Firoze Nov 18 '15 at 02:46
  • Maybe a solution can be storing the {offset, length} tuples in an array and then sort it on the offset values and THEN run the loop. Any help making that structure? @galaxyan – Adnan Firoze Nov 18 '15 at 02:50
0

I finally solved it. Took some major math logic to do but as my last comment with the intuition that - "Maybe a solution can be storing the {offset, length} tuples in an array and then sort it on the offset values and THEN run the loop. Any help making that structure?" - THAT DID THE TRICK.

output=input
l=[]
for x in range(0,result.print_entities()):
    print len(result.entities[x]["instances"])

    for y in range(0,len(result.entities[x]["instances"])):

        try: 
            url=r'"'+ "https://permid.org/1-" + result.entities[x]['resolutions'][0]['permid'] + r'"'

        except:
            url=r'"'+"https://en.wikipedia.org/wiki/"+result.entities[x]["name"].replace(" ", "_") + r'"'

        print "Generating wiki page link"

 #THE PROBLEM WAS HERE 

        offsetstr=result.entities[x]["instances"][y]["offset"]
        lenstr=result.entities[x]["instances"][y]["length"]

#The KEY TO THE SOLUTION IS HERE
        l.append((offsetstr,lenstr,url))
       # res.append(output[preOffsetstr:offsetstr]+"<a href=" + url + ">" +      output[offsetstr:offsetstr+lenstr] + "</a>" + output[offsetstr+lenstr:])

print l

def getKey(item):
    return item[0]

l_sorted=sorted(l, key=getKey)


a=[]
o=[]
x=0
p=0
#And then simply run a for loop

for x in range(0,len(l_sorted)):
    p=x+1
    try:
        o=output[l_sorted[x][0]+l_sorted[x][1]:l_sorted[x][0]] + "<a href=" + str(l_sorted[x][2]) + ">" +  output[l_sorted[x][0]:(l_sorted[x][0]+l_sorted[x][1])] + "</a>" + output[l_sorted[x][0]+l_sorted[x][1]:(l_sorted[p][0]-1)]
        a.append(o)
    except:
        print ""

#+ output[l_sorted[x][0]+l_sorted[x][1]:]
#a.append(output[l_sorted[len(l_sorted)][0]] + l_sorted[len(l_sorted)][1]:l_sorted[len(l_sorted)][0]] + "<a href=" + str(l_sorted[len(l_sorted)][2]) + ">" + output[l_sorted[len(l_sorted)][0]:(l_sorted[len(l_sorted)][0]+l_sorted[len(l_sorted)][1])] + "</a>" + output[l_sorted[len(l_sorted)][0]+l_sorted[len(l_sorted)][1]:]
m=output[l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1]:l_sorted[len(l_sorted)-1][0]] + "<a href=" + str(l_sorted[len(l_sorted)-1][2]) + ">" +  output[l_sorted[len(l_sorted)-1][0]:(l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1])] + "</a>" + output[l_sorted[len(l_sorted)-1][0]+l_sorted[len(l_sorted)-1][1]:]
a.append(m)

print " ".join(a)

And WALLAH!:) - Thanks for the help folks. Hope it helps someone someday.