0

I've scraped a bunch of words from the dictionary, and created a massive CSV file with all of them, one word per row.

I have another function, which reads from that massive CSV file, and then creates smaller CSV files.

The function is supposed to create CSV files with only 500 words/rows, but something is amiss. The first file has 501 words/rows. The rest of the files have 502 words/rows.

Man, maybe I'm tired, but I can't seem to spot what exactly is causing this in my code. Or is there nothing wrong with my code at all?

Below is the part of the function that I'm assuming is causing the problem. The full function can be seen below that.

Suspect Part of Function

def create_csv_files():
  limit = 500
  count = 0
  filecount = 1
  zfill = 3
  filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format('1'.zfill(zfill))
  with open('C:\\Users\\Anthony\\Desktop\\Scrape\\Results\\dictionary.csv') as readfile:
    csvReader = csv.reader(readfile)
    for row in csvReader:
      term = row[0]
      if ' ' in term:
        term = term.replace(' ', '')
      if count <= limit:
        count += 1
      else:
        count = 0
        filecount += 1
        filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format(str(filecount).zfill(zfill))
      aw = 'a' if os.path.exists(filename) else 'w'
      with open(filename, aw, newline='') as writefile:
        fieldnames = [ 'term' ]
        writer = csv.DictWriter(writefile, fieldnames=fieldnames)
        writer.writerow({
          'term': term
        })

The Whole Function

def create_csv_files():
  limit = 500
  count = 0
  filecount = 1
  zfill = 3
  idiomsfilename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\idioms.csv'
  filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format('1'.zfill(zfill))
  with open('C:\\Users\\Anthony\\Desktop\\Scrape\\Results\\dictionary.csv') as readfile:
    csvReader = csv.reader(readfile)
    for row in csvReader:
      term = row[0]
      if 'idiom' in row[0] and row[0] != ' idiom':
        term = row[0][:-5]
        aw = 'a' if os.path.exists(idiomsfilename) else 'w'
        with open(idiomsfilename, aw, newline='') as idiomsfile:
          idiomsfieldnames = ['idiom']
          idiomswriter = csv.DictWriter(idiomsfile, fieldnames=idiomsfieldnames)
          idiomswriter.writerow({
            'idiom':term
          })
        continue
      else:
        if ' ' in term:
          term = term.replace(' ', '')
        if count <= limit:
          count += 1
        else:
          count = 0
          filecount += 1
          filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format(str(filecount).zfill(zfill))
        aw = 'a' if os.path.exists(filename) else 'w'
        with open(filename, aw, newline='') as writefile:
          fieldnames = [ 'term' ]
          writer = csv.DictWriter(writefile, fieldnames=fieldnames)
          writer.writerow({
            'term': term
          })
      print(term)
oldboy
  • 5,729
  • 6
  • 38
  • 86
  • What exactly is amiss? – absolutelydevastated Jul 11 '19 at 06:06
  • @absolutelydevastated oh wth it deleted one of my paragraphs?!?! the first file has 501 words/rows. the rest of the files have 502 words/rows. – oldboy Jul 11 '19 at 06:20
  • Why don't you use pandas for such manipulations? It's quite simpler and easy to understand. Refer [this](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe) – Vishnudev Krishnadas Jul 11 '19 at 06:25
  • @Vishnudev no idea what pandas is – oldboy Jul 11 '19 at 06:38
  • Pandas is a library used for data manipulation and analysis. Especially large data can be manipulated in a fast and comprehensive way. It provides data structures to hold the data. Example `Dataframe`(two-dimensional data structure with rows and column) can be used to hold the `csv` data. Use `pandas.read_csv` to read csv data. – Vishnudev Krishnadas Jul 11 '19 at 06:42
  • @Vishnudev interesting ill check it out one of these days – oldboy Jul 13 '19 at 21:07

1 Answers1

2

So the reason why the files have weird number of rows is because of your if-else conditions.

You increment count when count is less than or equal to limit. For your very first iteration, you increment to 1 then write your first term, then increment and so on. Because you use <= instead of the strict inequality, you will still increment at count = 500 and write the 501st word.

From the second loop onwards, your first word is written at count = 0. The loop terminates again at count = 501 so you write 502 words this time.

To fix this, check for count >= limit, and create a new file if so. Increment count after you write to the CSV file and not before. That should help.

def create_csv_files():
  limit = 500
  count = 0
  filecount = 1
  zfill = 3
  filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format('1'.zfill(zfill))
  with open('C:\\Users\\Anthony\\Desktop\\Scrape\\Results\\dictionary.csv') as readfile:
    csvReader = csv.reader(readfile)
    for row in csvReader:
      term = row[0]
      if ' ' in term:
        term = term.replace(' ', '')
      # Remove if and keep else
      if count >= limit:
        count = 0
        filecount += 1
        filename = 'C:\\Users\\Anthony\\Desktop\\Scrape\\Dictionary\\terms{}.csv'.format(str(filecount).zfill(zfill))
      aw = 'a' if os.path.exists(filename) else 'w'
      with open(filename, aw, newline='') as writefile:
        fieldnames = [ 'term' ]
        writer = csv.DictWriter(writefile, fieldnames=fieldnames)
        writer.writerow({
          'term': term
        })
        count += 1 # Increment here
absolutelydevastated
  • 1,657
  • 1
  • 11
  • 28
  • oh right the order was the problem because it would end up in situations where the counter gets reset to 0 and then adds another one, so that when you add the second one the counter will only be at 1. incrementing after and then changing the condition should do the trick – oldboy Jul 11 '19 at 06:41
  • 1
    this can also be fixed by resetting to `1` instead of `0` i believe – oldboy Jul 11 '19 at 06:43
  • @BugWhisperer I wanted to suggest that too, but I have a bit of compulsiveness in that I like my counters to start at the same number. But yes, that will work as well. You'll still need to make the else clause a strict inequality though. – absolutelydevastated Jul 11 '19 at 06:48
  • @BugWhisperer Also, I think you should look at the Pandas suggestion. It'll make your function much shorter. You can slice the dataframe to the actual number of rows you want. – absolutelydevastated Jul 11 '19 at 06:52
  • i want all the rows so thats not a factor. out of curiosity, simplify which part of my code? – oldboy Jul 13 '19 at 21:09
  • @BugWhisperer You'll do something like `all_words = pd.read_csv(file_path)`, get the length of words and loop over in steps of 500 using `for i in range(0, all_words.shape[0], 500)`, slice out every 500 words using the `iloc` indexer like `my_slice = all_words.iloc[i * 500: (i + 1) * 500]` and write to your output `my_slice.to_csv(filename.format(i))`. That should reduce everything to just 4~5 lines. – absolutelydevastated Jul 14 '19 at 04:17
  • it would reduce the whole function to 4 or 5 lines?!?! – oldboy Jul 16 '19 at 23:39
  • @BugWhisperer Um, yes. You don't actually have to do the logic part of incrementing counters, checking if the current file exists and all. – absolutelydevastated Jul 17 '19 at 01:50
  • that would be amazing. ill def have to look into it now – oldboy Jul 17 '19 at 20:43
  • @BugWhisperer My suggested code is buggy, but that's the idea. Plus it doesn't involve any complex dataframe manipulation so it should be straightforward to resolve. – absolutelydevastated Jul 18 '19 at 02:56
  • cool. thanks for bringing that to my attention. hopefully i dont forget about this when i go to modify those scripts – oldboy Jul 19 '19 at 02:08