1

I have a following csv file (each line is dynamic number of characters but the columns are fixed... hope i am making sense)

   **001**     Math        **02/20/2013**  A

   **001**     Literature  **03/02/2013**  B

   **002**     Biology     **01/01/2013**  A

   **003**     Biology     **04/08/2013**  A

   **001**     Biology     **05/01/2013**  B

   **002**     Math        **03/10/2013**  C

I am trying to get results into another csv file in the following format where it is grouped by student id and order by date ascending order.

   001,#Math;A;02/20/2013#Biology;B;05/01/2013#Literature;B;03/02/2013

   002,#Biology;A;01/01/2013#Math;C;03/10/2013

   003,#Biology;A;04/08/2013

There is one constraint though. The input file is huge around 200 millions rows. I tried using c# and storing it DB and write sql query. Its very slow and not accepted. After googling i hear python is very powerful for these operations. I am new to Python started playing with the code. I really appreciate the PYTHON gurus to help me to get the results as I mentioned above.

Think
  • 261
  • 6
  • 13
  • Does every line in the file use a fixed number of characters? Because it may help to provide a faster solution. – Markon May 17 '13 at 21:51
  • @Markon hmmmm I dont think so – Think May 17 '13 at 21:51
  • Ok. So basically each line has a dynamic number of characters. You should re-edit your question, because it seems the CSV has a fixed structure. – Markon May 17 '13 at 21:52
  • Edited :(each line is dynamic number of characters but the columns are fixed... hope i am making sense) – Think May 17 '13 at 21:56
  • 2
    I think your DB approach is the correct one... Use a DB's bulk import to take the data, create a view over it returning it in a suitable key order, then use itertools.groupby to aggregate the results. Alternatively, look at a database with MapReduce and get that to do the heavy lifting – Jon Clements May 17 '13 at 22:13
  • Also, if you have 200 million rows, why isn't this stuff in a database already? You normally only end up with that many rows from an export... – Jon Clements May 17 '13 at 22:16
  • Btw, what are those crazy data formats?? – Jakub M. May 17 '13 at 22:19

1 Answers1

2
content='''
   **001**     Math        **02/20/2013**  A

   **001**     Literature  **03/02/2013**  B

   **002**     Biology     **01/01/2013**  A

   **003**     Biology     **04/08/2013**  A

   **001**     Biology     **05/01/2013**  B

   **002**     Math        **03/10/2013**  C
'''

from collections import defaultdict

lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())

aggregated = defaultdict(list)

for items in items_iter:
    stud, class_, date, grade = (t.strip('*') for t in items)
    aggregated[stud].append((class_, grade, date))

for stud, data in aggregated.iteritems():
    full_grades = [';'.join(items) for items in data]
    print '{},#{}'.format(stud, '#'.join(full_grades))

Output:

003,#Biology;A;04/08/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
001,#Math;A;02/20/2013#Literature;B;03/02/2013#Biology;B;05/01/2013

Of course, this is an ugly hackish code just to show you how it can be done in python. When working with large streams of data, use generators and iterators, and don't use file.readlines(), just iterate. The iterators will not read all the data at once but read chunk-by-chunk when you iterate over them, and not earlier.

If you are concerned if 200m records fit memory, then do the following:

  1. sort the records into separate "buckets" (like in bucket sort) by students id

    cat all_records.txt | grep 001 > stud_001.txt # do if for other students also

  2. do the processing per bucket

  3. merge

grep is just example. make a fancier script (awk or also python) that will filter by student ID and, for example, filter all with ID < 1000, later 1000 < ID < 2000 and so on. You can do it safely because your records per student are disjoint.

Community
  • 1
  • 1
Jakub M.
  • 32,471
  • 48
  • 110
  • 179
  • 1
    If 200 million records fits in memory... (or at least the non-duplicated parts anyway).... – Jon Clements May 17 '13 at 22:04
  • 1
    Jakub M.: 200 million * 1 byte ~ 200MB. So, if your line has 10 bytes, you need ~2GB of memory. The problem here is not the number of iterations, but "where" you store the lines. – Markon May 17 '13 at 22:07
  • 1
    @Markon even given 1 byte - factor in Python's object overhead, and it gets even smaller quicker... – Jon Clements May 17 '13 at 22:08
  • @Jakub M Thank you very much. I will try your example and see how it behaves. Eventually this script will be running on a production windows environment. +1 for a quick solution. – Think May 17 '13 at 22:17
  • Still, @JonClements suggestion about the database is the right way to go, if you a are not doing some single shot, throw-away script – Jakub M. May 17 '13 at 22:20
  • @Jakum M I just realized the results are not ordered by date. Any idea how to achieve that. – Think May 18 '13 at 19:29
  • use `sorted` and specify `key` parameter – Jakub M. May 18 '13 at 22:03
  • @JakubM I did made some progress by converting your script to read from a file. But still date sorted is not working. Any help ? with open('student.txt','r') as data1: accounts = defaultdict(list) for line in data1: sid, subject, dateReg, rank = line.split()[:4] accounts[sid].append((subject, rank, dateReg)) print accounts sortedData = sorted(accounts.iteritems(), key=operator.itemgetter(2)) for key, value in sortedData: print key print(value) for sid, data1 in sortedData: report = [';'.join(items) for items in data1] print '{},#{}'.format(sid, '#'.join(report)) – Think May 21 '13 at 16:28