0

So I have this huge CSV, I made a test script to see the number of rows, and it return about 24 million rows. I want to extract the number of rows that have the same CIK number, and transfer that data in separate CSV.

So the wanted output in the other file would be:

CIK number: number of IP with that CIK number.

I had some ideas, but they weren't efficient enough, so the script was useless, because it took for ages to go through csv. So did someone come a cross a similar problem as I have?

Should I use Pandas for this, any suggestion would be a huge help !

Example of the CSV:

enter image description here

Community
  • 1
  • 1
Stefan
  • 375
  • 1
  • 9
  • 24

3 Answers3

2

How about using a Counter?

import collections, csv

with open("big.csv") as csvfile:
    counter = collections.Counter(row["cik"] for row in csv.DictReader(csvfile))

with open("out.csv", "w") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(("cik", "count"))
    for cik, count in counter.items():
        writer.writerow((cik, count))

Memory overhead for this will depend on how many CIKs there are, not how many rows. I don't know how many that is, so I don't know if it'll be an issue.

brunns
  • 2,689
  • 1
  • 13
  • 24
  • This is a good answer. You can also easily write the results to a csv file line-by-line, which should be quick too. – m13op22 Apr 22 '19 at 02:52
  • @brunns Well I already tried with Collection/Counter but had some problems with the code, probably my bad implementation of it. It works, but it takes about 9 minutes to complete for date and cik. I don't know would Pandas do it any quicker? – Stefan Apr 23 '19 at 17:33
  • Sorry, I'm not a pandas user. Maybe it could, but I can't help you with that. I just gave you the best solution I could think of in pure python. – brunns Apr 23 '19 at 20:06
  • @HS-nebula, as between us you are most familiar with Pandas, can you answer my earlier question ? – Stefan Apr 23 '19 at 20:30
  • @Stefan I mean, the way to do it in Pandas is shown in my answer. You could probably read in the file using the `csv` module into a list of lists and then convert that to a dataframe. That might make loading the file faster. I'm not sure the `groupby` will be faster than this answer though. – m13op22 Apr 23 '19 at 20:34
1

You could use pandas to groupby the CIK column, then use size() to get the total number of CIK values.

For example,

import pandas as pd

df = pd.read_csv('name.csv')

ndf = df.groupby('CIK').size()

ndf.to_csv("CIK_number.csv")

I think I may have misunderstood your desired output. If you want the CIK, IP pairs by the number of times the CIK appears, do

df = pd.DataFrame({'CIK': ['102', '102', '103', '103', '104'], 'IP':['103.92.134', '103.92.134', '103.92.135', '103.92.136', '105.32.134'], 'C1': [1, 2, 3, 4,5 ], 'C2':[1,0,0,1,0]})

ndf = df.groupby(['CIK','IP'])['CIK'].size().reset_index()

ndf.rename(columns={0: 'count'}, inplace=True)

### returns

    CIK          IP  count
0  102  103.92.134      2
1  103  103.92.135      1
2  103  103.92.136      1
3  104  105.32.134      1

To read in a large CSV file in pandas do

chunks = []
for chunk in pd.read_csv('csv_name.csv', chunksize = 100):
    chunks.append(chunk)
df = pd.concat(chunks)
m13op22
  • 2,168
  • 2
  • 16
  • 35
  • I'm not that familiar with pandas, but will tryout your solution – Stefan Apr 21 '19 at 17:58
  • @Stefan I may have misunderstood your desired output. Check if my edit is what you want. – m13op22 Apr 21 '19 at 18:05
  • To read in a large CSV in `pandas` you can use the `chunksize` parameter, then concatenate the smaller dataframes. – m13op22 Apr 21 '19 at 18:11
  • I tried out your past solution @HS-nebula but my script crashed, maybe it can't handle so large file? – Stefan Apr 21 '19 at 18:14
  • And the question about the output: CIK number: number of IP with that CIK number – Stefan Apr 21 '19 at 18:14
  • This last suggestion seems good. but how to implement a csv file into that? – Stefan Apr 21 '19 at 18:16
  • @Stefan see my edit for how to read in the file. To save it, you can do what I suggested in the first part. – m13op22 Apr 21 '19 at 18:21
  • I think this will work, but my fear is it will take ages to complete, and that is something I wanted to avoid – Stefan Apr 21 '19 at 18:46
  • @Stefan You can of course adjust the chunksize and see which takes the least amount of time. Alternatively, you can read in the file line-by-line as suggested in another answer, convert to a dataframe, and use my suggestions. – m13op22 Apr 21 '19 at 18:48
  • Links like these may help you [https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) @Stefan. However, I do think that using Pandas will be easiest to get the count. You could also only read in the `CIK` and `IP` columns(`pd.read_csv(filename, usecols=['CIK','IP'])`), do the groupby, then save the results. (That way you're not copying and reading the entire dataframe. – m13op22 Apr 21 '19 at 18:56
-1

You can use pandas, but it is not very efficient:

import pandas
df = pandas.read_csv('my.csv')

But the quick and dirty way is to simply stream the file and pull with simple string matching and then write your new file:

with open("my.csv") as infile:
    for line in infile:
        if "CIK_number" in line : 
            do_something_with(line)
  • python's csv reader is fine too, but if you are just re-making the exact file maybe you dont need it? I guess I am assuming you dont want the big object in memory so you want a simple stream. – Aaron Ruddick Apr 21 '19 at 18:05
  • You didn't understand my question, I don't want to enter a specific cik. I want for the script to extract how many IP have the same CIK for every IP in the csv, but with the size of the csv, it is problem... – Stefan Apr 21 '19 at 18:19
  • If you stream each line, the size doesn't matter. It will not load the full file in memory. – Aaron Ruddick Apr 21 '19 at 18:22