3

I have a super large csv.gzip file that has 59 mill rows. I want to filter that file for certain rows based on certain criteria and put all those rows in a new master csv file. As of now, I broke the gzip file into 118 smaller csv files and saved them on my computer. I did that with the following code:

import pandas as pd
num = 0
df = pd.read_csv('google-us-data.csv.gz', header = None, 
compression =   'gzip', chunksize = 500000,
names = ['a','b','c','d','e','f','g','h','i','j','k','l','m'],
error_bad_lines = False, warn_bad_lines = False)

for chunk in df:
    num = num + 1
    chunk.to_csv('%ggoogle us'%num ,sep='\t', encoding='utf-8'

The code above worked perfectly and I now have a folder with my 118 small files. I then wrote code to go through the 118 files one by one, extract rows that matched certain conditions, and append them all to a new csv file that I've created and named 'google final us'. Here is the code:

import pandas as pd
import numpy
for i in range (1,118)
    file = open('google final us.csv','a')
    df = pd.read_csv('%ggoogle us'%i, error_bad_lines = False, 
    warn_bad_lines = False)
    df_f = df.loc[(df['a']==7) & (df['b'] == 2016) & (df['c'] =='D') & 
    df['d'] =='US')]
    file.write(df_f)

Unfortunately, the code above is giving me the below error:

KeyError                                  Traceback (most recent call last)
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in
get_loc(self, key, method, tolerance)
   1875             try:
-> 1876                 return self._engine.get_loc(key)
   1877             except KeyError:
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item  
(pandas\hashtable.c:12408)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item 
(pandas\hashtable.c:12359)()
KeyError: 'a'
During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
<ipython-input-9-0ace0da2fbc7> in <module>()
      3 file = open('google final us.csv','a')
      4 df = pd.read_csv('1google us')
----> 5 df_f = df.loc[(df['a']==7) & (df['b'] == 2016) & 
      (df['c'] =='D') & (df['d'] =='US')]
      6 file.write(df_f)
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in 
__getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in 
_getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\generic.py in 
_get_item_cache(self, item)
  1343         res = cache.get(item)
  1344         if res is None:
-> 1345             values = self._data.get(item)
  1346             res = self._box_item_values(item, values)
  1347             cache[item] = res
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\internals.py in 
get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items)) 
 [isnull(self.items)]
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in 
get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return 
   self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, 
   tolerance=tolerance)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item 
(pandas\hashtable.c:12408)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item  
(pandas\hashtable.c:12359)()
KeyError: 'a'

Any ideas what's going wrong? I've read numerous other stackoverflow posts (eg. Create dataframes from unique value pairs by filtering across multiple columns or How can I break down a large csv file into small files based on common records by python), but still not sure how to do this. Also, if you have a better way to extract data than this method - please let me know!

Community
  • 1
  • 1
SizzyNini
  • 275
  • 1
  • 4
  • 11
  • Why are you using `file.write()` and not pandas' [to_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html)? – Parfait Sep 29 '16 at 21:08

3 Answers3

2
import pandas
import glob
csvFiles = glob.glob(path + "/split files/*.csv")
list_ = []
for files in csvFiles:
    df = pandas.read_csv(files, index_col=None)
    df_f = df[(df['a']==7) & (df['b'] == 2016) & (df['c'] =='D') & df['d']=='US')]
    list_.append(df_f)
frame = pandas.concat(list_, ignore_index=True)
frame.to_csv("Filtered Appended File")

Keep all the files in split files folder in working directory...

This should work... by reading all required files in a directory

reading an csv takes alot of memory... so breaking them and working on them is a possible solution ... seems to be you are on right track with that...

  • Did the job wonderfully - thanks!! Also, for anyone else looking through this - this post is really helpful as well - http://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe/36416258#36416258 – SizzyNini Oct 03 '16 at 17:22
0

When you use file.write(df_f) you are effectively saving a string representation of the DataFrame, which is meant for humans to look at. By default that representation will truncate rows and columns so that large frames can be displayed on the screen in a sensible manner. As a result column "a" may get chopped.

with open('google final us.csv','a') as file:
    for i in range(1, 118):
        headers = i == 1
        ...
        df_f.to_csv(file, headers=headers)

I did not test the above snippet, but you should get an idea how to get going now.

There other issues with this code, which you may want to correct:

  1. Open the file to write before the loop, close it after. Best to use context manager.

  2. If the entire data fits in memory why go through a trouble to split it into 118 files? Simply filter it and save the resulting DataFrame using df.to_csv() method.

  3. Instead of pandas consider using csv.DictReader and filter the lines on the fly.

Lastly, if this a one-time job, why even write code for something that you could accomplish with a grep command (on Unix-like systems)?

Victor Olex
  • 1,458
  • 1
  • 13
  • 28
  • I'm not able to load the original gzip file into memory - it's too big and python crashes. Thanks for the other tips - I'll give them a shot. I don't have a UNIX based system and have not had too much experience with it -I'll definitely explore it if I don't find an alternate solution. – SizzyNini Sep 29 '16 at 20:28
0

It's maybe not even necessary to split the giant CSV into multiple files. You can use the iterator and chunksize options of the read_csv function (for such slightly more advanced uses see the docs).

file_path = 'my_big_file.csv'
chunksize = 100000
df_chunks = []

for df in pd.read_csv(file_path, chunksize=chunksize, iterator=True):
    df_f = df.loc[(df['a']==7) & (df['b'] == 2016) & (df['c'] =='D') & 
df['d'] =='US')]
    df_chunks.append(df_f)

master_df = pd.concat(df_chunks)

There's also the low_memory option you can pass to read_csv too:

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

For some more advanced operations over the whole dataset (e.g. you need to group some datasets), it might also be worth considering inputing the data into an SQL database and doing the operations there – pandas provides some support in SQL reading/writing. There's a nice example explained here.

metakermit
  • 21,267
  • 15
  • 86
  • 95