Count the number of rows that have the same value , 24 mil rows

Question

So I have this huge CSV, I made a test script to see the number of rows, and it return about 24 million rows. I want to extract the number of rows that have the same CIK number, and transfer that data in separate CSV.

So the wanted output in the other file would be:

CIK number: number of IP with that CIK number.

I had some ideas, but they weren't efficient enough, so the script was useless, because it took for ages to go through csv. So did someone come a cross a similar problem as I have?

Should I use Pandas for this, any suggestion would be a huge help !

Example of the CSV:

brunns · Accepted Answer · 2019-04-22T06:02:47.710

2

How about using a Counter?

import collections, csv

with open("big.csv") as csvfile:
    counter = collections.Counter(row["cik"] for row in csv.DictReader(csvfile))

with open("out.csv", "w") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(("cik", "count"))
    for cik, count in counter.items():
        writer.writerow((cik, count))

Memory overhead for this will depend on how many CIKs there are, not how many rows. I don't know how many that is, so I don't know if it'll be an issue.

edited Apr 22 '19 at 06:02

answered Apr 21 '19 at 20:38

brunns

2,689
1
13
24

This is a good answer. You can also easily write the results to a csv file line-by-line, which should be quick too. – m13op22 Apr 22 '19 at 02:52
@brunns Well I already tried with Collection/Counter but had some problems with the code, probably my bad implementation of it. It works, but it takes about 9 minutes to complete for date and cik. I don't know would Pandas do it any quicker? – Stefan Apr 23 '19 at 17:33
Sorry, I'm not a pandas user. Maybe it could, but I can't help you with that. I just gave you the best solution I could think of in pure python. – brunns Apr 23 '19 at 20:06
@HS-nebula, as between us you are most familiar with Pandas, can you answer my earlier question ? – Stefan Apr 23 '19 at 20:30
@Stefan I mean, the way to do it in Pandas is shown in my answer. You could probably read in the file using the `csv` module into a list of lists and then convert that to a dataframe. That might make loading the file faster. I'm not sure the `groupby` will be faster than this answer though. – m13op22 Apr 23 '19 at 20:34

m13op22 · Answer 2 · 2019-04-21T18:21:31.640

1

You could use pandas to groupby the CIK column, then use size() to get the total number of CIK values.

For example,

import pandas as pd

df = pd.read_csv('name.csv')

ndf = df.groupby('CIK').size()

ndf.to_csv("CIK_number.csv")

I think I may have misunderstood your desired output. If you want the CIK, IP pairs by the number of times the CIK appears, do

df = pd.DataFrame({'CIK': ['102', '102', '103', '103', '104'], 'IP':['103.92.134', '103.92.134', '103.92.135', '103.92.136', '105.32.134'], 'C1': [1, 2, 3, 4,5 ], 'C2':[1,0,0,1,0]})

ndf = df.groupby(['CIK','IP'])['CIK'].size().reset_index()

ndf.rename(columns={0: 'count'}, inplace=True)

### returns

    CIK          IP  count
0  102  103.92.134      2
1  103  103.92.135      1
2  103  103.92.136      1
3  104  105.32.134      1

To read in a large CSV file in pandas do

chunks = []
for chunk in pd.read_csv('csv_name.csv', chunksize = 100):
    chunks.append(chunk)
df = pd.concat(chunks)

edited Apr 21 '19 at 18:21

answered Apr 21 '19 at 17:57

m13op22

2,168
2
16
35

I'm not that familiar with pandas, but will tryout your solution – Stefan Apr 21 '19 at 17:58
@Stefan I may have misunderstood your desired output. Check if my edit is what you want. – m13op22 Apr 21 '19 at 18:05
To read in a large CSV in `pandas` you can use the `chunksize` parameter, then concatenate the smaller dataframes. – m13op22 Apr 21 '19 at 18:11
I tried out your past solution @HS-nebula but my script crashed, maybe it can't handle so large file? – Stefan Apr 21 '19 at 18:14
And the question about the output: CIK number: number of IP with that CIK number – Stefan Apr 21 '19 at 18:14
This last suggestion seems good. but how to implement a csv file into that? – Stefan Apr 21 '19 at 18:16
@Stefan see my edit for how to read in the file. To save it, you can do what I suggested in the first part. – m13op22 Apr 21 '19 at 18:21
I think this will work, but my fear is it will take ages to complete, and that is something I wanted to avoid – Stefan Apr 21 '19 at 18:46
@Stefan You can of course adjust the chunksize and see which takes the least amount of time. Alternatively, you can read in the file line-by-line as suggested in another answer, convert to a dataframe, and use my suggestions. – m13op22 Apr 21 '19 at 18:48
Links like these may help you [https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) @Stefan. However, I do think that using Pandas will be easiest to get the count. You could also only read in the `CIK` and `IP` columns(`pd.read_csv(filename, usecols=['CIK','IP'])`), do the groupby, then save the results. (That way you're not copying and reading the entire dataframe. – m13op22 Apr 21 '19 at 18:56

score -1 · Answer 3 · answered Apr 21 '19 at 18:03

-1

You can use pandas, but it is not very efficient:

import pandas
df = pandas.read_csv('my.csv')

But the quick and dirty way is to simply stream the file and pull with simple string matching and then write your new file:

with open("my.csv") as infile:
    for line in infile:
        if "CIK_number" in line : 
            do_something_with(line)

answered Apr 21 '19 at 18:03

Aaron Ruddick

27
2

python's csv reader is fine too, but if you are just re-making the exact file maybe you dont need it? I guess I am assuming you dont want the big object in memory so you want a simple stream. – Aaron Ruddick Apr 21 '19 at 18:05
You didn't understand my question, I don't want to enter a specific cik. I want for the script to extract how many IP have the same CIK for every IP in the csv, but with the size of the csv, it is problem... – Stefan Apr 21 '19 at 18:19
If you stream each line, the size doesn't matter. It will not load the full file in memory. – Aaron Ruddick Apr 21 '19 at 18:22

Count the number of rows that have the same value , 24 mil rows

3 Answers3