5

I have the following information in a python sqlite3 database which has around 4 million records.

Term         No of articles      Article Ids
Obama           300            [411,523,534, …. 846]
Gandhi         3900            [23,32,33…..4578]
Mandela        3900            [21,14,56,145 …4536]
George Bush     450            [230,310 … 700]
Tony Blair      350            [225,320 … 800]
Justin Bieber   25             [401 , 420, 690 …. 904]
Lionel Messi    150            [23, 78, …… 570]

'Article Ids' is a blob holding the list of ids (as returned by an API)

I have the task of finding the common-ids from the Id List for each term and save them in 'relationships.db'

How do I establish relationships where I find which articles talk both about Gandhi & Mandela together (intersecting article ids) ?

The relationships.db should look like this;

Term 1              Term 2          No of Common Article Ids    Common Article IDS
Obama               Gandhi                17                    [34,123,25 ...]
Obama               Mandela               43                    [145,111,234,456 ....]
Obama               George Bush           46
Obama               Tony Blair             2
Obama               Justin Bieber         36
Obama               Lionel Messi           3
Gandhi              Mandela               40
Gandhi              George Bush           41
Gandhi              Tony Blair            32
Gandhi              Justin Bieber         31
Gandhi              Lionel Messi          20
Mandela             George Bush           20
Mandela             Tony Blair            11
Mandela             Justin Bieber         19
Mandela             Lionel Messi          39
George Bush         Tony Blair            46
George Bush         Justin Bieber         49
George Bush         Lionel Messi           2
Tony Blair          Justin Bieber         50
Tony Blair          Lionel Messi           3
Justin Bieber       Lionel Messi           6

Using a 'for loop' to loop through every term to get the intersection is painful. Is there an efficient way to do this? Will there be trade-offs between 'memory' and 'speed'?

richie
  • 17,568
  • 19
  • 51
  • 70

2 Answers2

2

This is hinted at by the comments, but it is impossible (or at least very hard), to process the information while it is still in a BLOB. You could make more progress if you restructure your database:

Term      Article_id
Ghandi    33
Obama     411
Obama     523
Ghandi    23
Obama     846
...
Mandela   23

This "flat" representation has a few advantages. First, it makes adding new articles easy, you don't have to extract the blob. Secondly, with proper indexing, your original counts can be recovered easily:

SELECT COUNT(*) FROM news WHERE Term="Ghandi"`

These can be stored in a separate table if you want. To extract joint articles, you would search for something like:

SELECT A.Article_id, A.Term, B.Term FROM news AS A 
JOIN news AS B ON A.Article_id = B.Article_id 
AND A.Term != B.Term

Note that this will double count, but this easily be corrected. Here is a complete minimal working example with a modified table to show more matches:

import sqlite3

conn = sqlite3.connect(":memory:")

raw_items = '''
Ghandi    33
Obama     411
Obama     521
Ghandi    23
Obama     21
Ghandi    411
Mandela   21'''

script = '''
CREATE TABLE news ( 
Term STRING,
Article_id INTEGER
);'''

conn.executescript(script)
items = [line.split() for line in raw_items.strip().split('\n')]
conn.executemany("INSERT INTO news VALUES (?,?)", items)

cmd = '''SELECT COUNT(*) FROM news WHERE Term="Obama"'''
print "Obama size: ", conn.execute(cmd).fetchone()

cmd = '''
SELECT A.Article_id, A.Term, B.Term FROM news AS A 
JOIN news AS B ON A.Article_id = B.Article_id 
AND A.Term != B.Term '''

for result in conn.execute(cmd).fetchall():
    print result

This gives:

Obama size:  (3,)
(411, u'Obama', u'Ghandi')
(21, u'Obama', u'Mandela')
(411, u'Ghandi', u'Obama')
(21, u'Mandela', u'Obama')
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Yes. This would make it faster at the cost of disk space while 'restructuring the data'. – richie Jul 25 '14 at 12:59
  • 1
    @richie the way you are storing it now does not provide any of the advantages of relational database. Since you can't run any queries against it, you'd be just as well off storing the data in it's own text file. The "restructuring" not only gives speed, but flexibility to new queries. – Hooked Jul 25 '14 at 13:09
1

You could find some workaround with pandas.

1) Create a pandas DataFrame with pandas.read_sql

2) Then you can get a cross-join like this one suggested by @logc

3) After that you can convert lists to sets and apply a intersection.

If you need help to implement, I'll help you latter, in a bit hurry now.

EDIT:

Ok, it is very simple indeed, but I don't know if it has the performance you need, maybe you will need to read step by step the csv file:

import pandas, sqlite3

conn = sqlite3.connect(databaseFilePath)
df=pandas.read_sql('SELECT * FROM Terms;',conn)
df['Article Ids'] = df['Article Ids'].apply(eval).apply(set)
df['key'] = False
df2 = pandas.merge(df,df,on='key')
df2 = df2[df2.Term_x!=df2.Term_y]
df2['Common Articles IDS'] = df2.apply(lambda row:set.intersection(row['Article Ids_x'], row['Article Ids_y']), axis=1)
df2['No of Common Articles Ids'] = df2['Common Articles IDS'].apply(len)
df2['Common Articles IDS'] = df2['Common Articles IDS'].apply(list).apply(str)
df2[['Term_x','Term_y', 'No of Common Articles Ids', 'Common Articles IDS']].to_sql(outputTableName, conn) 
Community
  • 1
  • 1
caiohamamura
  • 2,260
  • 21
  • 23
  • When I try `df2['Common Articles IDS'] = df2.apply(lambda row:set.intersection(row['Article Ids_x'], row['Article Ids_y']), axis=1)` I get the error `ValueError: Wrong number of items passed 1, indices imply 9` – richie Jul 25 '14 at 11:16
  • Do you have any term that has null Article_ids? If so, how it is referenced? '[]', ''? – caiohamamura Jul 25 '14 at 13:37
  • No I don't have null Article_ids. But there would be null intersections. Could that be a reason? – richie Jul 25 '14 at 16:26
  • No, I've tested with the data you provided, replacing ... and it works fine. Perhaps you could supply some more data for me to test. – caiohamamura Jul 26 '14 at 14:57
  • Have you done it? Maybe some of the rows are not returning sets you can try to index those: Try this before `df['key'] = False`: `df[df['Article Ids'].apply(type) != type(set())]` to see if it return any line that is not a set. – caiohamamura Jul 28 '14 at 16:45