Basic Statistics with csv files without using Pandas

Question

I am trying to run basic statisics in my program without the use of numpy and pandas. I want to calculate the overall average # of books borrowed. However, my total_loans len() does not return a count of the total # of books that were borrowed, thus causing my program to return 1 which is incorrect.

any advice would be appreciated!

My code below is supposed to return the overall average number of days a book was borrowed if it appears in both the bookloans.csv and books.csv based on the common book_id field (only books.csv has headers, bookloans.csv does not have any headers.

import csv

total_loan_days = 0
total_loans = 0

# Open the books file and read the title and authors into a list
with open('books.csv', 'r', encoding='utf-8-sig') as csv_books_file:
    books_reader = csv.reader(csv_books_file)
    next(books_reader)  # skip the header row
    for row in books_reader:
        books = row[0]


with open('bookloans.csv', 'r', encoding='utf-8-sig') as csv_bookloans_file:
    loans_reader = csv.reader(csv_bookloans_file)
    for rows in loans_reader:
        book_number = row[0]
        return_date = rows[3]
        start_date = rows[2]
        date_diff = int(return_date) - int(start_date)
        
        
        total_loan_days += date_diff
        total_loans = len(book_number)

       
#Calculate the overall average loan days for all books
    if date_diff > 1:
        overall_average_loan_days = total_loan_days / total_loans
    else:
        overall_average_loan_days = 1

    print(f"Overall Average Loan Days: {over`all_average_loan_days}")

Should `total_loans` be overwritten after each row? Seems unproductive. — B Remmelzwaal, Jul 16 '23 at 17:41
As `date_diff` is the difference in the last iteration of the for-loop (loan days of last row in the CSV), `if date_diff > 1:` doesn't make sense. — Michael Butscher, Jul 16 '23 at 17:51
`# Open the books file and read the title and authors into a list` I will note that isn't what your code does, but I'm not sure that matters, since I do t seen where you use `books` — juanpa.arrivillaga, Jul 16 '23 at 20:36

score 0 · Answer 1 · answered Jul 18 '23 at 00:12

I would start by reading the main list of books into a dict, keyed to a book's ID:

import csv

books: dict[str, list[str]] = {}

with open("books.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    header = next(reader)
    assert header == ["ID", "Title", "Author"], f"{header} != {['ID', 'Title', 'Author']}"

    for row in reader:
        books[row[0]] = row

After reading this CSV:

ID,Title,Author
1,A Closed and Common Orbit,Becky Chambers
2,The Clan of the Cave Bear,Jean M. Auel
3,Sum,David Eagleman
4,the 13 1/2 lives of Captain Bluebear,Walter Moers

the dict books looks like:

{
    "1": ["1", "A Closed and Common Orbit", "Becky Chambers"],
    "2": ["2", "The Clan of the Cave Bear", "Jean M. Auel"],
    "3": ["3", "Sum", "David Eagleman"],
    "4": ["4", "the 13 1/2 lives of Captain Bluebear", "Walter Moers"],
}

Then I'd read the loans CSV into a separate dict, and parse the checked-out and returned timestamps into real date values (it will make the math and stats, later on, easier and more accurate):

from datetime import date, datetime, timedelta

DATE_FMT = r"%Y/%m/%d"

loans: dict[str, list[tuple[date, date]]] = {}

with open("book_loans.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)

    for row in reader:
        book_id = row[0]
        checked_out = datetime.strptime(row[1], DATE_FMT).date()
        returned = datetime.strptime(row[2], DATE_FMT).date()

        if book_id not in loans:
            loans[book_id] = []

        loans[book_id].append((checked_out, returned))

After reading this loans CSV:

1,2000/01/01,2000/01/13
2,2000/01/01,2000/01/10
3,2000/01/04,2000/01/16
2,2000/01/11,2000/02/01
3,2000/01/12,2000/01/13
1,2000/01/12,2000/01/26

The loans dict looks like:

{
    "1": [(date(2000, 1, 1), date(2000, 1, 13)), (date(2000, 1, 12), date(2000, 1, 26))],
    "2": [(date(2000, 1, 1), date(2000, 1, 10)), (date(2000, 1, 11), date(2000, 2, 1))],
    "3": [(date(2000, 1, 4), date(2000, 1, 16)), (date(2000, 1, 12), date(2000, 1, 13))],
}

a book ID that keys to a list of pairs of dates (checked-out and returned for each pair).

Note: book 4, Capt. Bluebear, was never loaned.

Now, with the complete list of all available books and a list of loan dates for each book (that was actually loaned), you can apply whatever logic/math you need.

I went for:

for book_id, row in books.items():
    title = row[1]
    author = row[2]

    if book_id not in loans:
        print(f"'{title}' by {author} was never loaned out. ")
        continue

    loan_dates = loans[book_id]
    first_checked_out = date.max
    last_returned = date.min
    loan_durations: list[timedelta] = []

    for checked_out, returned in loan_dates:
        if checked_out < first_checked_out:
            first_checked_out = checked_out
        if returned > last_returned:
            last_returned = returned
        loan_durations.append(returned - checked_out)

    # https://stackoverflow.com/a/3617540/246801, give sum() a starting value of timedelta(0)
    avg_loan = sum(loan_durations, timedelta(0)) / len(loan_durations)

    print(
        f"'{title}', by {author}, was loaned {len(loan_durations)} times.  It was first checked out on {first_checked_out} and last returned on {last_returned}.  The average loan was {avg_loan.days} days."
    )

which prints:

'A Closed and Common Orbit', by Becky Chambers, was loaned 2 times.  It was first checked out on 2000-01-01 and last returned on 2000-01-26.  The average loan was 13 days.
'The Clan of the Cave Bear', by Jean M. Auel, was loaned 2 times.  It was first checked out on 2000-01-01 and last returned on 2000-02-01.  The average loan was 15 days.
'Sum', by David Eagleman, was loaned 2 times.  It was first checked out on 2000-01-04 and last returned on 2000-01-16.  The average loan was 6 days.
'the 13 1/2 lives of Captain Bluebear' by Walter Moers was never loaned out.

Basic Statistics with csv files without using Pandas

1 Answers1