1

[Using Python3.3] I have one huge CSV file that contains XX million rows and include a couple of columns. I want to read that file, add a couple of calculated columns and spit out a couple of 'segmented' csv-files. I've tried a smaller test file on the following code, and it does exactly what I wanted it to do. But now I'm loading the original CSV file (which is about 3.2 GB) and I get a memory error. Is there a more memory efficient way of writing the below code?

Please note that I'm very new to Python thus there are probably lots of stuff I am not totally aware of.

Example input data:

email               cc  nr_of_transactions  last_transaction_date   timebucket  total_basket
email1@email.com    us  2                   datetime value          1           20.29
email2@email.com    gb  3                   datetime value          2           50.84
email3@email.com    ca  5                   datetime value          3           119.12
...                 ... ...                 ...                     ...         ...

This is my code:

import csv
import scipy.stats as stats
import itertools
from operator import itemgetter


def add_rankperc(filename):
    '''
    Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules:
     Percentage rank between 75 and 100 -> top25
     Percentage rank between 25 and 74  -> mid50
     Percentage rank between 0 and 24   -> bottom25
    '''

    # Defining headers for ease of use/DictReader
    headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket']
    groups = []

    with open(filename, encoding='utf-8', mode='r') as f_in:
        # Input file is tab-separated, hence dialect='excel-tab'
        r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers)
        # DictReader reads all dict values as strings, converting total_basket to a float
        dict_list = []
        for row in r:
            row['total_basket'] = float(row['total_basket'])
            # Append row to a list (of dictionaries) for further processing
            dict_list.append(row)

    # Groupby function on cc and total_basket
    for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')):
        rows = list(group)
        for row in rows:
            # Calculates the percentile rank for each value for each country
            row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket'])
            # Percentage rank between 75 and 100 -> top25
            if 75 <= row['rankperc'] <= 100:
                row['rankbucket'] = 'top25'
            # Percentage rank between 25 and 74 -> mid50
            elif 25 <= row['rankperc'] < 75:
                row['rankbucket'] = 'mid50'
            # Percentage rank between 0 and 24 -> bottom25
            else:
                row['rankbucket'] = 'bottom25'
            # Appending all rows to a list to be able to return it and use it in another function
            groups.append(row)
    return groups


def filter_n_write(data):
    '''
    Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping.
    '''

    # Creating group iterator based on keys
    for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')):
        # List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys.
        emails = list([row['email'] for row in group])
        # Dynamically naming output file based on grouping keys
        f_out = 'output-{}-{}.csv'.format(key[0], key[1])
        with open(f_out, encoding='utf-8', mode='w') as fout:
            w = csv.writer(fout, dialect='excel', lineterminator='\n')
            # Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell.
            w.writerows([email] for email in emails)

filter_n_write(add_rankperc('infile.tsv'))

Thanks in advance!

Matthijs
  • 779
  • 1
  • 8
  • 19
  • "I have one huge CSV file that contains about 46 million rows and include a couple of columns" .... Why? This is about the least efficient method of storing data... You should switch your method of data storage instead of try to make a CSV work for you... Why not try some SQL? (Or anything else that actually uses databases, or storage methods *meant* for storing large amounts of data - unlike a csv file) – Inbar Rose Jul 04 '13 at 15:05
  • Because this is a csv is the export out of a database system. Why I'm writing a python script is because of the 'grouping' and writing output to multiple csv files. You're right that I can do this in the database system, however it will require me to download every single list of email addresses, which could be up to 180 csv files. So instead I thought about writing a script to do that work for me. Does that make a bit more sense? – Matthijs Jul 04 '13 at 15:47
  • Why not interact with the database directly with Python? And then just extract exactly what you need, and create the output/result files you want in the most efficient manner.. – Inbar Rose Jul 04 '13 at 15:48

2 Answers2

4

The pandas library (http://pandas.pydata.org/) has very nice and fast CSV reading capabilities (http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table). As an added bonus you will have your data as numpy arrays, making it very easy to calculate percentiles. This question discusses reading a large CSV in chunks with pandas.

Community
  • 1
  • 1
Robert Franke
  • 2,224
  • 2
  • 17
  • 10
  • Hi Robert, I have heard about pandas but have no experience with it. Will look into it, hopefully it will make some sense. At the same time, any other help on the current problem would be more than appreciated. – Matthijs Jul 04 '13 at 18:28
3

I agree with Inbar Rose that it would be better to use database functions to attack this problem. Let's say we need to answer the question as you asked it, though - I think we can, at the expense of speed.

You're probably running out of memory in constructing the list of all the rows' dictionaries. We can work around this by only considering a subset of the rows at a time.

Here's my code for the first step - roughly your add_rankperc function:

import csv
from scipy.stats import percentileofscore
from operator import itemgetter

# Run through the whole file once, saving each row to a file corresponding to
# its 'cc' column
cc_dict = {}
with open(input_path, encoding="utf-8", mode='r') as infile:
  csv_reader = csv.reader(infile, dialect="excel-tab")
  for row in csv_reader:
    cc = row[1]
    if cc not in cc_dict:
      intermediate_path = "intermediate_cc_{}.txt".format(cc)
      outfile = open(intermediate_path, mode='w', newline='')
      csv_writer = csv.writer(outfile)
      cc_dict[cc] = (intermediate_path, outfile, csv_writer)
    _ = cc_dict[cc][2].writerow(row)

# Close the output files
for cc in cc_dict.keys():
  cc_dict[cc][1].close()

# Run through the whole file once for each 'cc' value
for cc in cc_dict.keys():
  intermediate_path = cc_dict[cc][0]
  with open(intermediate_path, mode='r', newline='') as infile:
    csv_reader = csv.reader(infile)
    # Pick out all of the rows with the 'cc' value under consideration
    group = [row for row in csv_reader if row[1] == cc]
    # Get the 'total_basket' values for the group
    A_scores = [float(row[5]) for row in group]
    for row in group:
      # Compute this row's 'total_basket' score based on the rest of the
      # group's
      p = percentileofscore(A_scores, float(row[5]))
      row.append(p)
      # Categorize the score
      bucket = ("bottom25" if p < 25 else ("mid50" if p < 75 else "top100"))
      row.append(bucket)
  # Save the augmented rows to an intermediate file
  with open(output_path, mode='a', newline='') as outfile:
    csv_writer = csv.writer(outfile)
    csv_writer.writerows(group)

46 million rows is a lot, so this will probably be slow. I avoided using the DictReader functionality of the csv module and just indexed the rows directly to avoid that overhead. I also computed the first argument to percentileofscores once for each group instead of for every row in the group.

If this works then I think you can follow the same idea for the filter_n_write function - run through the generated intermediate file once, picking out (timebucket, rank) pair. Then go thorugh the intermediate file again, once for each pair.

bbayles
  • 4,389
  • 1
  • 26
  • 34
  • Hi Bo, your code is working and is solving the MemoryError problem. However now I dug up another one, namely that the script takes ages to complete - it's been running for hours as of now. Looking into other solutions. – Matthijs Jul 05 '13 at 15:16
  • I edited my solution to avoid spinning through all 46M rows for each 'cc' value and instead use a bunch of intermediate files. Does that help? – bbayles Jul 05 '13 at 17:35