17

I have a 10gb CSV file that contains some information that I need to use.

As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.

Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.

Is there an efficient way to perform this task in Python? May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?

Ulderique Demoitre
  • 1,029
  • 3
  • 11
  • 26
  • Maybe: https://stackoverflow.com/questions/10717504/is-it-possible-to-use-read-csv-to-read-only-specific-lines – languitar Mar 20 '17 at 10:09
  • While it is possible in many different ways doing it in python, sometimes its more practical to just split the file (using e.g. `split -l 100 filename`) into smaller files before processing them with python. – sphere Mar 20 '17 at 10:13

4 Answers4

16

Here is the short answer.

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Here is the very long answer.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

import pandas as pd
from sqlalchemy import create_engine

Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.

file = '/path/to/csv/file'

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

print pd.read_csv(file, nrows=5)

This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

To do this, we’ll first need to create the sqllite database using the following command.

csv_database = create_engine('sqlite:///csv_database.db')

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j
      i+=1
      df.to_sql('table', csv_database, if_exists='append')
      j = df.index[-1] + 1

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

To access the data now, you can run commands like the following:

df = pd.read_sql_query('SELECT * FROM table', csv_database)

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:

df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
ASH
  • 20,759
  • 19
  • 87
  • 200
5

You can use pandas.read_csv() with chuncksize parameter:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
    # each chunck_df contains a part of the whole CSV
Guillaume
  • 5,497
  • 3
  • 24
  • 42
  • 1
    after reading each chunk_df we can concat each one to get our full CSV file in one data frame. df = pd.concat(chunk_df, ignore_index = True) – Debashis Sahoo Dec 14 '18 at 09:25
2

This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.

import pandas as pd
import os

The chunksize here orders the number of rows within the csv file you want to read later

chunksize2 = 2000

path = './'
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2

start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1",
                skiprows=chunksize2*start_chunk)

headers = []

for i, df2 in enumerate(data2):
try:

    print('reading cvs....')
    print(df2)
    print('header: ', list(df2.keys()))
    print('our header: ', headers)

    # Access chunks within data

    # for chunk in data:

    # You can now export all outcomes in new csv files
    file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
    save_path = os.path.abspath(
        os.path.join(
            path, file_name
        )
    )
    print('saving ...')

except Exception:
    print('reach the end')
    break
Leonardo
  • 119
  • 10
0

Method to transfer huge CSV into database is good because we can easily use SQL query. We have also to take into account two things.

FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.

For example converted to bd file:

https://nycopendata.socrata.com/Social-Services/311-Service-Requests- from-2010-to-Present/erm2-nwe9

For this db file SQL language:

pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)

It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second). It could be malicious to add that downloading directly from a csv file is a bit more efficient:

giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)

(time of operation 16,5 second)

SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:

df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])

All code for file 311 as about I pointed:

start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000 
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
## -----------------------------------------------------------------------
      df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])  # to datetimes
      df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
      df.index += j
      i+=1
      df.to_sql('table', WM_csv_datab7, if_exists='append')
      j = df.index[-1] + 1
print(time.time() - start_time)

At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

Wojciech Moszczyński
  • 2,893
  • 21
  • 27