1

My program needs a function that reads data from a csv file ("all.csv") and extracts all the data pertaining to 'Virginia' (extract each row that has 'Virginia in it), then writes the extracted data to another csv file named "Virginia.csv" The program runs without error; however, when I open the "Virginia.csv" file, it is blank. My guess is that the issue is with my nested for loop, but I am not entirely sure what is causing the issue.

Here is the data within the all.csv file:
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

Here is my code:

import csv

input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []

def extract_records_for_state (input_file, output_file, state):
    with open(input_file, 'r') as infile:
        contents = infile.readlines()
        
        with open(output_file, 'w') as outfile:
            writer = csv.writer(outfile)
        
            for row in range(len(contents)):
                contents[row] = contents[row].split(',') #split elements
            
            for row in range(len(contents)):
                for word in range(len(contents[row])):
                
                    if contents[row][2] == state:
                        writer.writerow(row)
                
                
extract_records_for_state(input_file,output_file,state)

stevelinberg
  • 313
  • 2
  • 7
Geraldo
  • 31
  • 5
  • I suggest stepping through your code line-by-line with a debugger, it should reveal the actual issue pretty fast. It's a really handy skill to be able to debug your own code; in fact I'd say it's an essential day 1 skill. PyCharm is free and has an excellent debugger: https://www.jetbrains.com/help/pycharm/debugging-your-first-python-application.html One thought I have though is that if the state names are on the last column of a row, the newline character will be included, since you're using `readlines()`. See: https://stackoverflow.com/questions/12330522/how-to-read-a-file-without-newlines – Random Davis Nov 11 '21 at 16:59
  • It should raise an error at `writer.writerow(row)` because "row" is an integer. `content[row]` is the row. – tdelaney Nov 11 '21 at 17:05
  • Strange. I am using Jupyter Notebook, but the debugger doesn't give me any errors. – Geraldo Nov 11 '21 at 17:19
  • It seems like you don't have the right data. You'd only get the error if you see "Virginia" in the row. – tdelaney Nov 11 '21 at 20:35

3 Answers3

2

I ran your code and it gave me an error

Traceback (most recent call last): File "c:\Users\Dolimight\Desktop\Stack Overflow\Geraldo\main.py", line 27, in extract_records_for_state(input_file, output_file, state) File "c:\Users\Dolimight\Desktop\Stack Overflow\Geraldo\main.py", line 24, in extract_records_for_state writer.writerow(row) _csv.Error: iterable expected, not int,

I fixed the error by putting the contents of the row [contents[row]] into the writerow() function and ran it again and the data showed up in Virginia.csv. It gave me duplicates so I also removed the word for-loop.

import csv

input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []


def extract_records_for_state(input_file, output_file, state):
    with open(input_file, 'r') as infile:
        contents = infile.readlines()

        with open(output_file, 'w') as outfile:
            writer = csv.writer(outfile)

            for row in range(len(contents)):
                contents[row] = contents[row].split(',')  # split elements

            print(contents)

            for row in range(len(contents)):
                if contents[row][2] == state:
                    writer.writerow(contents[row]) # this is what I changed


extract_records_for_state(input_file, output_file, state)
Dolimight
  • 203
  • 1
  • 8
  • I tried the change you made, but I still get an empty file. – Geraldo Nov 11 '21 at 17:51
  • @Geraldo Jupyter Notebook must have admin privialges [see this](https://stackoverflow.com/questions/43248141/how-can-i-run-jupyter-notebook-in-admin-privilege/56514286) – Dolimight Nov 11 '21 at 18:12
  • I don't think admin privileges is the problem, and that's generally a dangerous thing to do. If you have rights problems, it will result in an exception, not just an empty file. – tdelaney Nov 11 '21 at 20:36
1

You have two errors. The first is that you try to write the row index at writer.writerow(row) - the row is contents[row]. The second is that you leave the newline in the final column on read but don't strip it on write. Instead you could leverage the csv module more fully. Let the reader parse the rows. And instead of reading into a list, which uses a fair amount of memory, filter and write row by row.

import csv

input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []

def extract_records_for_state (input_file, output_file, state):
    with open(input_file, 'r', newline='') as infile, \
            open(output_file, 'w', newline="") as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        # add header
        writer.writerow(next(reader))
        # filter for state
        writer.writerows(row for row in reader if row[2] == state)

extract_records_for_state(input_file,output_file,state)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I tried this code you provided, but it only writes the header to the file. The rest of the data is missing. Is this something wrong on my end of things, rather than the code? I also tried @Dolimight 's fix, but again it still gives me a blank csv file. I am not sure why it works for him and not me. – Geraldo Nov 11 '21 at 18:21
  • I get 78651 rows in Virginia.csv. Must be something else. Your "all.csv" may not be the same thing we downloaded. You could experiment. Instead of writing the file with `writerows`, you could do something like `import collections`, and then ` counts = collections.Counter(row[2] for row in reader)`. Then print it `for k,v in counts.items(): print(k,v)` which would show you all of the unique values for row[2] and how often they are seen. There should be 70k for Virginia. – tdelaney Nov 11 '21 at 20:34
1

Looking at your code two things jump out at me:

  1. I see a bunch of nested statements (logic)
  2. I see you reading a CSV as plain text, then interpreting it as CSV yourself (contents[row] = contents[row].split(',')).

I recommend two things:

  1. break up logic into distinct chunks: all that nesting can be hard to interpret and debug; do one thing, prove that works; do another thing, prove that works; etc...
  2. use the CSV API to its fullest: use it to both read and write your CSVs

I don't want to try and replicate/fix your code, instead I'm offering this general approach to achieve those two goals:

import csv

# Read in
all_rows = []
with open('all.csv', 'r', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header (I didn't see you keep it)

    for row in reader:
        all_rows.append(row)

# Process
filtered_rows = []
for row in all_rows:
    if row[2] == 'Virginia':
        filtered_rows.append(row)

# Write out
with open('filtered.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(filtered_rows)

Once you understand both the logic and the API of those discrete steps, you can move on (advance) to composing something more complex, like the following which reads a row, decides if it should be written, and if so, writes it:

import csv

with open('filtered.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)

    with open('all.csv', 'r', newline='') as f_in:
        reader = csv.reader(f_in)
        next(reader) # discard header

        for row in reader:
            if row[2] == 'Virginia':
                writer.writerow(row)

Using either of those two pieces of code on this (really scaled-down) sample of all.csv:

date,county,state,fips,cases,deaths
2020-03-09,Fairfax,Virginia,51059,4,0
2020-03-09,Virginia Beach city,Virginia,51810,1,0
2020-03-09,Chelan,Washington,53007,1,1
2020-03-09,Clark,Washington,53011,1,0

gets me a filtered.csv that looks like:

2020-03-09,Fairfax,Virginia,51059,4,0
2020-03-09,Virginia Beach city,Virginia,51810,1,0

Given the size of this dataset, the second approach of write-on-demand-inside-the-read-loop is both faster (about 5x faster on my machine) and uses significantly less memory (about 40x less on my machine) because there's no intermediate storage with all_rows.

But, please take the time to run both, read them carefully, and see how each works the way it does.

Zach Young
  • 10,137
  • 4
  • 32
  • 53