1

I have a very large knowledge graph in pandas dataframe format as follows.

This dataframe KG has more than 100 million rows.

KG:

                   pred     subj      obj
        0   nationality     BART      USA
        1  placeOfBirth     BART  NEWYORK
        2     locatedIn  NEWYORK      USA
      ...           ...      ...      ...
116390740     hasFather     BART   HOMMER
116390741   nationality   HOMMER      USA
116390743  placeOfBirth   HOMMER  NEWYORK

I tried to get a row from this KG with a specific value for subj.

Using the subj column as a series, I tried to indexing the KG by generating a boolean series using isin() function as shown below.

KG[KG['subj'].isin(['BART', 'NEWYORK'])]

My desired output is

                   pred     subj      obj
        0   nationality     BART      USA
        1  placeOfBirth     BART  NEWYORK
        2     locatedIn  NEWYORK      USA
116390740     hasFather     BART   HOMMER

I have to repeat the above

But the above method takes a long time. Is there any way to reduce the time effectively than this method?

thanks!

Won chul Shin
  • 343
  • 1
  • 2
  • 8
  • 3
    Does this answer your question? [A faster alternative to Pandas \`isin\` function](https://stackoverflow.com/questions/23945493/a-faster-alternative-to-pandas-isin-function) – dm2 May 09 '21 at 09:52

1 Answers1

1

You can set/sort index and then pick the required values: Looking up rows based on index values is faster than looking up rows based on column values. It's faster when the index is sorted.

df = df.set_index('subj')
df = df.sort_index()
result = df.loc[['BART', 'NEWYORK']] 

You can try query after setting multiindex:

df = df.set_index(['subj','obj'])
df = df.sort_index()
df.query("subj in ['BART','NEWYORK'] & obj in ['USA','HOMMER']")
Nk03
  • 14,699
  • 2
  • 8
  • 22
  • Is there any other way to deal with the following conditions? `KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]` – Won chul Shin May 09 '21 at 11:33