7

I have a very large csv file with millions of rows and a list of the row numbers that I need.like

rownumberList = [1,2,5,6,8,9,20,22]

I know there is something called skiprows that helps to skip several rows when reading csv file like that

df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList

However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rows when using read_csv? Not try to select rows using dataframe afterwards, since I try to minimize the time of reading file.Thanks.

RSNboim
  • 130
  • 1
  • 8
huier
  • 165
  • 2
  • 4
  • 12

8 Answers8

15

There is a parameter called nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files (Docs)

pd.read_csv(file_name,nrows=int)

In case you need some part in the middle. Use both skiprows as well as nrows in read_csv.if skiprows indicate the beginning rows and nrows will indicate the next number of rows after skipping eg.

Example:

pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)

This will select data from the 6th row to 16 row

Edit based on comment:

Since there is a list this one might help i.e

li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • Thanks, Dark. But the problem is that I have a list contains the row numbers I need like list = [1,3,50,60] rather than choose 60 rows from the beginning. – huier Dec 21 '17 at 04:53
  • Its a much of hard work for the function to go through the csv and keeping counter of which line number it belongs. I would prefer you to load csv and then select the rows in dataframe – Bharath M Shetty Dec 21 '17 at 04:56
3
import pandas as pd

rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)

for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows

gaozhidf
  • 2,621
  • 1
  • 22
  • 17
1

I am not sure about read_csv() from Pandas (there is though a way to use an iterator for reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader (or csv.DictReader), leaving only the desired rows with the help of enumerate():

import csv

import pandas as pd


DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
    reader = csv.reader(input_file)

    desired_rows = [row for row_number, row in enumerate(reader)
                    if row_number in DESIRED_ROWS]

df = pd.DataFrame(desired_rows)

(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case @James's idea to have "start and "stop" would work generally better).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • That still reads through the entire file, as opposed to stopping when the last line of interest is read. – James Dec 21 '17 at 04:42
  • @James well, our solutions both have the same worst case (stop being the last line). But, that was a good idea to stop iterating if we know we don't need to go any further. Thanks. – alecxe Dec 21 '17 at 04:44
1
import pandas as pd

df = pd.read_csv('Data.csv')

df.iloc[3:6] 

Returns rows 3 through 5 and all columns.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html

J. Weikert
  • 21
  • 4
  • 1
    Thanks, but like I said in the description, I try to select rows when read_csv(), not afterwards. – huier Dec 21 '17 at 04:51
1

From de documentation you can see that skiprows can take an integer or a list as values to remove some lines.

So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:

with open('myfile.csv') as f:
    row_count = sum(1 for row in f)

Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.

skiplist = set(range(1, row_count+1)) - set(rownumberList)

Finally you can read the csv as normal.

df = pd.read_csv('myfile.csv',skiprows = skiplist)

here is the full code:

import pandas as pd

with open('myfile.csv') as f:
    row_count = sum(1 for row in f)

rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)

df = pd.read_csv('myfile.csv', skiprows=skiplist)
Traxidus Wolf
  • 808
  • 1
  • 9
  • 18
  • I was wrong at first, I misunderstood the question but now it works as wanted. The process of reading the lines takes very few seconds (tested with a 2.407.609 lines file). If your file wont change number of lines, you can just put the number there instead of row_count. – Traxidus Wolf Dec 21 '17 at 05:59
1

you could try this

import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want") 
# retrieving multiple rows by iloc method 
rows = data.iloc [[1,2,5,6,8,9,20,22]]
daniel zoulla
  • 11
  • 1
  • 4
0

You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.

However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading just the data you need into Python before converting it to a data frame in Pandas. For this you can use the csv module.

import csv
import pandas

start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
    reader = csv.reader(fp)
    for i, line in enumerate(reader):
        if i >= start:
            data.append(line)
        if i > stop:
            break

df = pd.DataFrame(data)
James
  • 32,991
  • 4
  • 47
  • 70
-2

for i in range (1,20)

the first parameter is the first row and the last parameter is the last row...