0

Suppose I am working with a large database showing details on books from an entire library. I want to get various statistics of the collection. For example, in this code, I am defining a method to get the top 10 most represented authors.

def most_owned_authors():
    db = 'database.db'
    conn = sqlite3.connect(db)
    cursor = conn.cursor()

    cursor.execute('''SELECT AUTHOR AS author FROM MAINTABLE WHERE OWNEDCHECKBOX = TRUE;''')

    authors_df = pd.DataFrame(cursor.fetchall())

    authors = []
    author_dict = {}

    for x in authors_df.iloc:
        authors.append(x.to_string(index=False))
    for x in authors:
        amount = authors.count(x)
        author_dict[x] = amount

    author_dict = dict(sorted(author_dict.items(), key=lambda item: item[1], reverse=True))

    top_10_owned_authors = {}

    for x, k in enumerate(author_dict):
        if x == 10: break
        top_10_owned_authors[k] = author_dict[k]

Is there an easier way, using SQLite3 and Pandas to generate statistics from an SQL Inquiry? Or do I have to manually create the logic in a way such as above? Can SQL SUM/COUNT across multiple instances of the same entry, and then just pull indices from the DataFrame? An additional example of the same idea would be to use SQL and Pandas to generate a DataFrame of books marked as "read" and the "year" they were read.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Kyle Carroll
  • 97
  • 1
  • 8

1 Answers1

2

I think you can do it in two different ways:

1- if you have a lot of records in your database and you prefer not to fetch all records and then compute the statistics. In this situation better to put the load in the database by writing SQL queries that do the job and compute the stats. Example:

import pandas as pd

top_10_df = pd.read_sql('''SELECT AUTHOR, count(*) as counts AS author FROM MAINTABLE WHERE OWNEDCHECKBOX = TRUE GROUP BY AUTHOR ORDER BY counts DESC LIMIT 10;''', conn)

This would be a good solution if the SQL is readable and not messy! But if the logic is complicated better and the table is not big better to write the logic in a readable manner (as it is easier to maintain).

2- If it is ok to load all the data in your code (when you have a small table and the query will not be very frequently executed):

Here, you can use 'group by', 'agg', and 'describe' to compute the statistics of the loaded data frame (more can be found here: Get statistics for each group (such as count, mean, etc) using pandas GroupBy?). See examples below:

# setup
import pandas as pd
import random  # just to generate some data
# generating a sample author list 
authors = ['A{}'.format(random.randint(1,idx)) for idx in range(1,100)]
authors_df = pd.DataFrame({'authors': authors}) # a dataframe of list of authors with duplicates - so, the task is to select top 10 

# group by - count how many records refer to an author
agg_author_df = authors_df.groupby('authors').size().reset_index(name='counts')

# sort and limit to 10
agg_author_df.sort_values(by='counts', ascending=False, inplace=True)
top_10_df = agg_author_df.head(10)
Ehsan Hamzei
  • 339
  • 2
  • 8
  • This is exactly what I was looking for! Putting the load into an SQL Query is very helpful. I wasn't aware that queries could be so powerful. Both solutions worked, so I'll add both to my wheelhouse for various situations! – Kyle Carroll Jun 25 '23 at 15:40