0

I have a sqlite table containing a column of filenames. Some filenames are duplicates of other files, so I'd like to iterate through each row, search the column for similar entries, and print those results to the console.

A print(row[0]) indicates that the first half of my findDupes loop works, iterating through each row. Things get weird when I make another sqlite statement to find similar entries and print the output. Instead of continuing with the loop, the loop only prints the first entry.

I'm not a SQL expert, so there's no telling what I'm doing wrong. Any help would be greatly appreciated. Thanks!

def getFiles():
    dirs = os.listdir(path)
    for files in dirs:
        c.execute('INSERT INTO myTable(files) VALUES(?)', (files,))

def findDupes():
    row = c.execute('select files from myTable order by files')
    while True:
        row = c.fetchone()
        if row == None:
            break
        c.execute('select files from myTable where files like ?',(row[0]+'%',))
        dupe = c.fetchone()
        print (dupe[0])
Stígandr
  • 2,874
  • 21
  • 36
  • You should use a new cursor for your second select. – John Sheridan Sep 30 '14 at 19:32
  • To be clear, looking at the query, you will not only be selecting duplicates but anything that has that file name as a prefix. So if you have both `my_name` and `my_name1` in your table, when you call the query `select files from myTable where files like 'my_name%'` you will return `my_name1`. I am not sure what the goal of your project is, but you should be aware of this lest it give you more problems. – JB333 Sep 30 '14 at 20:11

4 Answers4

2

First, your code doesn't reveal what c is - is it a connection object or a cursor? (either can be used in this object, but cursors are generally preferable) And why is it global?

Assuming it is a cursor object, then what is happening is during your first time through the loop, the second call to c.execute resets the query, so the second time c.fetchone is called, sqlite is looking in the results of the select files from myTable where files like ?

One way to solve this is to use multiple cursors; one for iterating over the file names and one to perform the duplicate lookup.

def findDupes(conn): #pass in your database connection object here
    file_curs = conn.cursor()
    file_curs.execute('select files from myTable order by files')
    while True:
        row = file_curs.fetchone()
        if row == None:
            break
        dup_curs = conn.cursor()
        dup_curs.execute('select files from myTable where files like ?',(row[0]+'%',))
        dupe = dup_curs.fetchone()
        print (dupe[0])

Note that you can perform the de-duplication entirely in SQL (see for example Deleting duplicate rows from sqlite database), but if you're new to SQL, you may want to stick with the above.

Community
  • 1
  • 1
Gretchen
  • 2,274
  • 17
  • 16
  • I assumed `c` was cursor but it's clear from his question and your answer, it was the connection object instead. +1 – aneroid Sep 30 '14 at 20:03
1

Your problem is that within your loop you're calling row = c.fetchone() each time, which will return one row from the most recent execution of a query on c. On the second loop this will be the results of c.execute('select files from myTable where files like ?',(row[0]+'%',)), with one row already fetched (so you're really setting row to the second result of c.execute('select files from myTable where files like ?',(row[0]+'%',)) with your current code, which I assume is probably returning None and breaking your loop).

Try this:

def findDupes():
    c.execute('select files from myTable order by files')
    rows = c.fetchall()
    for row in rows:
        c.execute('select files from myTable where files like ?',(row[0]+'%',))
        dupe = c.fetchone()
        print (dupe[0])
JB333
  • 244
  • 1
  • 7
1

You will need to change how you're using c. The first time your loop in findDupes() runs, it's fetching one row of files list. After that, you've got c.execute() for the like-ness select executing in the same iteration of the loop. When the 2nd iteration happens, the first c.fetchone() is fetching a row from your like-ness query, not the original all-files-by-order query which you had outside the loop.

Use a different variable or cursor for the two queries.

aneroid
  • 12,983
  • 3
  • 36
  • 66
0

You can solve this problem another way, by having the database server do the counting for you. That way you just run one query, instead of the inefficient way by fetching all the files, then checking each one by one:

def find_dupes():
   rows = c.execute('SELECT files, COUNT(*) FROM myTable GROUP BY files HAVING COUNT(*) > 1')
   return [row[0] for row in rows]

dupes = find_dupes()
print('\n'.join(dupes))
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • Based on the actual code, the results will actually be different from finding exact duplicates, it will be any files that have another file in the table as a prefix (see my comment on the main question above). But for pure duplicate finding I agree your solution is great! – JB333 Sep 30 '14 at 20:15