1

I am trying to get random 100K songs from musicbrainz database which is locally saved. I'm new to programming and want to know what is the reason behind my computer slowing down (probably ram filled). I'm running ubuntu on virtual machine. Please suggest some changes so I can go to sleep after running this.

import psycopg2
import random
import MySQLdb
from contextlib import closing

conn = psycopg2.connect("dbname='musicbrainz' user='musicbrainz' host='localhost' password='musicbrainz'")
conn1 = MySQLdb.connect(host = "localhost", user = "root", passwd = "40OZlike", db = "plalyst")
print("connections and cursors made...")
cur= conn1.cursor()
conn1.set_character_set('utf8')
cur.execute('SET NAMES utf8;')
cur.execute('SET CHARACTER SET utf8;')
cur.execute('SET character_set_connection=utf8;')
cur.close()

def migrateSongDB():
    try:
        cur1 = conn1.cursor()
        cur1.execute("select count(*) from Song")
        numberOfSongs = cur1.fetchall()[0][0]
        cur1.close()
        print("number of songs in our database is ")
        print(numberOfSongs)
        rnumbers = random.sample(range(1, 22660511), 100000-numberOfSongs)
        print("random numbers generated....")
        for eachnum in rnumbers:
            cur = conn.cursor()
            cur1 = conn1.cursor()
            print(eachnum)
            songName=""
            while(songName==""):
                cur.execute("""select name from track where id = %s """, (eachnum,))
                rows = cur.fetchall()
                print(rows)
                if not len(rows)==0:
                    songName = rows[0][0]
                eachnum+=1
            print("Got the track name:")
            print(songName)
            sql = 'INSERT into Song (name) values ( "'+songName+'")'
            print(sql)
            cur1.execute(sql)
            cur1.execute('commit')
            print("inserted into the song table....")
            cur.close()
            cur1.close()

        print("Songs Saved into new Data Base...")
        conn.close()
        conn1.close()
        print("Connections Closed")
    except:
        with conn1 as cursor:
            cursor.execute('select 1;')
            result = cursor.fetchall()
            for cur in result:
                print(cur)
        migrateSongDB()

def main():
    migrateSongDB()
    conn.close()
    conn1.close()

if __name__ == "__main__": main()

Thank you for your time to read this code. Also, if you guys have any suggestions for me to improve my coding style I would love to learn. Thank you once again.

  • Identify the cause of the slowdown with the usual system tools like top. – Klaus D. Apr 06 '17 at 02:02
  • 1
    @KlausD.: `top` is going to be *worthless* here. – Makoto Apr 06 '17 at 02:11
  • @Makoto ...and you know that from? – Klaus D. Apr 06 '17 at 02:13
  • @KlausD.: I've profiled Python applications in the past. Using `top` to track down a performance issue with this is akin to debugging C with pen. You're not going to get into the deep issues as to what's taking a while and you're not going to be able to expose what's really going on with the underlying script or functions. – Makoto Apr 06 '17 at 02:15
  • @Makoto It is not even clear where the bottleneck is. Before profiling the Python process check the assumption that it is the cause of the problem. – Klaus D. Apr 06 '17 at 02:18
  • It's *definitely* the script, and my hunch is in the `while` loop. A profiler would be able to expose more, but that's at least worth a try. In a worst case scenario, *none* of the IDs between 1 and 22660511 which are pulled exist in the database, so that means 22660511 *or more* calls to a select statement. The ideal approach would be to simply move away from the IDs that don't exist. – Makoto Apr 06 '17 at 02:19
  • Thank you @Makoto. I just checked and saw that the count(*) is exactly same as the last id... That is 22660511 – Krishna Chaitanya Nelloore Apr 06 '17 at 02:26

1 Answers1

0

My hunch is that the IDs you pull out of a hat are eventually going to wander into territory in which the ID doesn't exist. This means that you're going to run an perpetually long* loop, which is inefficient.

Instead of this approach, why not just pull the IDs out of the database which are known and select on that, instead?

You can accomplish that with this. I've borrowed a list flattening operation from this Stack Overflow answer which will make working with the ID list a cinch.

cur1.execute("select id from Song")
result = cur1.fetchall()
result = [item for sublist in result for item in sublist] 
# result holds the total amount of elements in your Song DB
# use len() to get the total number of rows from here
rnumbers = random.sample(result, 100000-len(result))

You can then get rid of the while loop, since you're guaranteed to have an ID that actually exists in your database.

Community
  • 1
  • 1
Makoto
  • 104,088
  • 27
  • 192
  • 230
  • I understand what you're saying but the thing is all the ids are present but the name attribute is empty in the musicbrainz db. Is there any way I can check how much memory is a variable or object occupying while running this program? Forgive me if I'm wrong im a newbie. Thank you. – Krishna Chaitanya Nelloore Apr 06 '17 at 18:26