Analyzing one column from a larger file

Question

I am using Pandas. I am trying to read in one column of names from a larger file. This file in total is 35GBs which made my kernel die. So I would like to just read in one of the columns. Then I would then like "chunk" this data, so the kernel doesn't die. From that, I need to get the sum by per name and find the name with the highest count. Here is what could be useful:

import pandas as pd

data = pd.read_csv("/Users/Desktop/EQR_Data/EQR_Transactions_1.csv", low_memory=False)

The column name I would like to import from my main file:

'seller_company_name'

Which column data do you want to sum if 'seller_company_name' is the only imported one? — SpghttCd, Apr 25 '18 at 19:15
All of the names in 'seller_company_data' are the ones I need to sum/count. There are 15 different names in the column, and each one has multiple entries. I need to sum those entries and find the name that came up the most. — SOCO, Apr 25 '18 at 21:14

score 0 · Answer 1 · answered Apr 25 '18 at 19:18

0

Sometimes you're better off just using the command line

if you have access to a unix like environment this is what grep / sed/ awk / cut were built for as they work with streams

See here for an example

An alternative would be to split your csv and batch it (delimited by ',' take 1st zero indexed column)

cat some.csv | cut -d, -f1 | sort | uniq -c

answered Apr 25 '18 at 19:18

Darren Brien

115
7

https://stackoverflow.com/questions/3194349/how-do-i-split-a-file-into-n-no-of-parts for the file splitting – Darren Brien Apr 25 '18 at 19:19

SpghttCd · Answer 2 · 2018-04-26T06:12:37.700

For reading in just one column use the keyword usecols:

data = pd.read_csv("/Users/Desktop/EQR_Data/EQR_Transactions_1.csv", usecols=['seller_company_name'])

Then you can groupby seller names:

grpd = df.groupby('seller_company_name')

In grpd.groups is then a dict, which contains the list of occurence indices for each seller. Turn it into a dict with the lengths of these lists:

result = {d: len(grpd.groups[d]) for d in grpd.groups}

Analyzing one column from a larger file

2 Answers2