1

I have a huge CSV where each line has a user ID. I want to find the UserID that turns up most frequently across the whole set.

I've used DictReader from the csv module to read in the csv, as I think this will be most useful to manipulate individual records later.

I can't find a way to search through the various values for the key "UserID" in my created dictionary and find the most frequent value.

Any help would be greatly appreciated!

Here is my code so far:

import csv
from time import time
start = time()
myData=open("myCSV.csv", "rb" )
csvReader= csv.DictReader(myData)

# declare counters
volume = 0

#sum all data usage
for data in csvReader:
    volume += float(data["volume_total"])
print "Total : %f" %volume

#calculate processing time
elapsed = time() - start
print "Processing time: %f seconds" %elapsed
SK10
  • 11
  • 1
  • 3
  • How are you storing the objects from `DictReader`? If you're storing them in a list for instance, have you tried printing the first element of that list? – Jon Clements Nov 24 '12 at 17:40
  • Hi @JonClements. I'm new to Python, so I'm unsure what you mean. As I understand it, the objects from `DictReader` are stored as (key, value) pairs, and not as a list...? Have I misunderstood? – SK10 Nov 24 '12 at 17:46

2 Answers2

0

Create a dictionary "usersOccurences" with keys which will be UserId and values that will be encounters count. Iterate with DictReader over all records from csv file.

For each UserId you have found - try to find a corresponding occurences number in usersOccurences dictionary and increment it, if current UserId is not exists in this dictionary - add it with starting number 1.

After iteration you can explore usersOccurences dictionary and find UserId that have maximal number.

Mikhail Tsaplin
  • 642
  • 1
  • 9
  • 21
  • I'll try this. I am concerned that this will be very slow since I have a huge dataset with hundreds of thousands of different UIDs and looping through the whole usersOccurrences dictionary each time will be inefficient. Thank you for your response! – SK10 Nov 24 '12 at 17:55
  • May be you will need some caching capabilities if you have one csv file and its new rows is appended. – Mikhail Tsaplin Nov 24 '12 at 18:15
0

Alternatively you can use numpy loadtxt

  from numpy import loadtxt
  a,b = loadtxt("filename.csv", usecols=(0,1), unpack=True)

Then you can create a dict using list comprehension(it's faster!)

  a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
  d = {x:a.count(x) for x in a}

refer to this question.

To find the most frequent user-id:

  user_id_list = d.keys()
  frequency = d.values()
  index = frequency.index(max(frequency))
  famous_user = user_id_list[index]
Community
  • 1
  • 1
Thiru
  • 3,293
  • 7
  • 35
  • 52
  • Thanks for your answer. I hadn't heard of the Numpy package and am trying it out. However, when using the line `a,b = loadtxt("mycsv.csv", usecols=(0,2), unpack=True)` I'm getting a `IndexError: list index out of range` but my data definitely has at least 3 columns, and there are no gaps in the first and third columns. Do you know why I might be getting this error? – SK10 Nov 25 '12 at 21:09
  • ah i've fixed it - the second column (time) was formatted YYYY-MM-DD HH:MM:SS and loadtxt was treating the whitespace between date and time as a delimiter. Fixed with a `delimiter=','`. Thank you! – SK10 Nov 25 '12 at 21:21