1

I have the following SQL query for finding overlaps between begin and end for a particular note_id:

select a.*, b.*
from test.analytical_cui_mipacq_concepts_new a
inner join test.analytical_cui_mipacq_concepts_new b on ( 
    ( b.begin>=a.begin and b.begin<=a.end )
    or
    ( b.begin<=a.begin and b.end>=a.begin )
)
where ((a.system='metamap' and  b.system!=a.system) or (a.system='metamap' and  b.system=a.system and a.id_ != b.id_ and a.note_id = b.note_id))

that is taking forever and a day to run. I am trying to follow this thread to convert to a pandas merge: pandas-join-dataframe-with-condition

and I so far came up with (new is my original dataframe, note_id is how I identify a particular individual, and id_ is the pk from the db table):

a = new.copy()
b = new.copy()
b.columns

b = b.rename(index=str, columns={'end':'end_x', 'begin': 'begin_x', 'cui': 'cui_x', 
                                 'old_cui': 'old_cui_x', 'type': 'type_x', 
                                 'polarity': 'polarity_x', 'id_':'id_x'}) 

c = a.merge(b, how='inner', on=['note_id'])

print(len(a), len(b), len(c))
c.loc[(((c.begin >= c.begin_x) & (c.begin <= c.end_x)) 
       | ((c.begin<=b.begin_x) & (c.end>=c.begin_x))) &
      (((c.system=='metamap') &  (c.system!=c.system_x)) 
       | ((c.system_x=='metamap') & (c.system==c.system_x) 
          & (c.id_ != c.id_x) & (c.note_id == c.note_id_x)))]

When I run this, I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-e8c0d060f2a0> in <module>()
     32 print(len(a), len(b), len(c))
     33 c.loc[(((c.begin >= c.begin_x) & (c.begin <= c.end_x)) 
---> 34        | ((c.begin<=b.begin_x) & (c.end>=c.begin_x))) &
     35       (((c.system=='metamap') &  (c.system!=c.system_x)) 
     36        | ((c.system_x=='metamap') & (c.system==c.system_x) 

/anaconda3/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1674 
   1675         elif isinstance(other, ABCSeries) and not self._indexed_same(other):
-> 1676             raise ValueError("Can only compare identically-labeled "
   1677                              "Series objects")
   1678 

ValueError: Can only compare identically-labeled Series objects

Not exactly sure what this means, even after Googling around for it.

The data look like:

begin,polarity,end,note_id,type,system,cui,id_
31,1,37,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0004352,1
63,1,71,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,2
81,1,86,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0039869,3
96,1,100,527982345,biomedicus.v2.UmlsConcept,biomedicus,C1123023,4
96,1,105,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0015230,5
101,1,105,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0015230,6
130,1,138,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,7
143,1,144,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0184661,8
156,1,162,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0026591,9
176,1,185,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0004268,10
201,1,209,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,11
101,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168094
100,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168095
109,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168096
124,1,129,527982345,org.metamap.uima.ts.Candidate,metamap,C0205435,168097
124,1,129,527982345,org.metamap.uima.ts.Candidate,metamap,C1279901,168098
130,1,138,527982345,org.metamap.uima.ts.Candidate,metamap,C0574032,168099
130,1,138,527982345,org.metamap.uima.ts.Candidate,metamap,C1827465,168100
143,1,144,527982345,org.metamap.uima.ts.Candidate,metamap,C0021966,168101
143,1,144,527982345,org.metamap.uima.ts.Candidate,metamap,C0221138,168102
31,1,37,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0004352,55414
599,1,603,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0206655,55415
67,1,73,4069123471-4,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C3263723,55416
646,-1,650,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0042109,55417
31,1,37,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32496
56,1,71,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,C0993666,32497
92,1,105,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32498
96,1,100,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32499
120,1,129,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,C2008415,32500
horcle_buzz
  • 2,101
  • 3
  • 30
  • 59
  • That means the Series `a` and `b` have different indexes, and pandas does not define Series comparison in this case. The same error occurs with the test `a = pd.Series([1, 2], index=[0, 1]); b = pd.Series([1, 2], index=[0, 2]); a == b`. Could you post a few lines of example data? – Peter Leimbigler Mar 08 '19 at 02:48
  • Done. I'm basically trying to find overlaps in my `begin` and `end` columns across a single `note_id` instance.. – horcle_buzz Mar 08 '19 at 03:11
  • 2
    can you post the data not as an image but as actual text so that we can paste it into our IDE's? thanks! – gold_cy Mar 08 '19 at 03:17
  • Done. Pasting from excel makes it an image, for some stupid reason. – horcle_buzz Mar 08 '19 at 03:25
  • 1
    you should probably sample your data given what you provided does not match some of the conditions you specify, such as `system == 'metamap'` – gold_cy Mar 08 '19 at 03:27
  • One more update to data about to happen. – horcle_buzz Mar 08 '19 at 03:38

0 Answers0