2

I am using the following code for reading CSV file to a dictionary.

file_name = path+'/'+file.filename
with open(file_name, newline='') as csv_file:
    csv_dict = [{k: v for k, v in row.items()}
                for row in csv.DictReader(csv_file)]
    for item in csv_dict:
        call_api(item)

Now this is reading the files and calling the function for each of the row. As the number of rows increases, the number of calls also will increase. Also it is not possible to load all the contents to memory and split and call API from there as the size of the data is big. So I would like to follow an approach, so that the file will be read using limit and offset as in the case of SQL queries. But how can this be done in Python ? I am not seeing any option to specify the number of rows and skip rows in the csv documentation. Is someone can suggest a better approach also that will be fine.

martineau
  • 119,623
  • 25
  • 170
  • 301
Happy Coder
  • 4,255
  • 13
  • 75
  • 152
  • Your indentation is wrong. What does your file look like? Is it just one file? How many "columns" per line? How many lines are in it? DYZ "fixed" your indentation - is that how it looks on your side? – Patrick Artner Feb 11 '20 at 07:02
  • You iterate the whole parsed dict, why parse all lines at first by list comprehension? just go line by line and call your api also line by line? – Patrick Artner Feb 11 '20 at 07:03
  • Can you please show me an example for doing so ? – Happy Coder Feb 11 '20 at 07:04
  • I have one file. For testing it has five records and three columns. I am reading it to a dictionary using csv.DictReader. This helps me to get the column headings attached to each item. Then I am calling the API. – Happy Coder Feb 11 '20 at 07:06
  • Your `csv_dict` is really a **`list`** and could be created more succinctly with `csv_dict = list(csv.DictReader(csv_file))` since rows from a `DictReader` are already dictionaries. – martineau Feb 11 '20 at 07:15
  • 1
    If your matter is solved please mark the answer as accepted so that others can see that your question has been answered – FredrikHedman Mar 03 '20 at 12:12

2 Answers2

3

You can call your api directly just with 1 line in memory:

with open(file_name, newline='') as csv_file:
    for row in csv.DictReader(csv_file): 
        call_api(row)        # call api with row-dictionary, don't persist all to memory 

You can skip lines using next(row) before the for loop:

with open(file_name, newline='') as csv_file:
    for _ in range(10):   # skip first 10 rows
        next(csv_file)        
    for row in csv.DictReader(csv_file):

You can skip lines in between using continue:

with open(file_name, newline='') as csv_file:
    for (i,row) in enumerate(csv.DictReader(csv_file)):
        if i%2 == 0: continue                  # skip every other row

You can simply count parsed lines and break after n lines are done:

n = 0
with open(file_name, newline='') as csv_file:
    for row in csv.DictReader(csv_file):
        if n == 50: 
            break
        n += 1

and you can combine those approaches to skip 100 rows and take 200, only taking every 2th one - this mimics limit and offset and hacks using modulo on the line number.

Or you use something thats great with csv, like pandas:

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

A solution could be to use pandas to read the csv:

import pandas as pd

file_name = 'data.csv'
OFFSET = 10
LIMIT = 24
CHSIZE = 6
header = list('ABC')
reader = pd.read_csv(file_name, sep=',',
                     header=None, names=header,          # Header 'A', 'B', 'C'
                     usecols=[0, 1, 4],                  # Select some columns
                     skiprows=lambda idx: idx < OFFSET,  # Skip lines
                     chunksize=CHSIZE,                   # Chunk reading
                     nrows=LIMIT)

for df_chunk in reader:
    # Each df_chunk is a DataFrame, so
    # an adapted api may be needed to
    # call_api(item)
    for row in df_chunk.itertuples():
        print(row._asdict())
FredrikHedman
  • 1,223
  • 7
  • 14
  • Thanks. This works fine. Only thing is that I don't need the index column. Even if I tried with `index_col=False` it is picking the index. Also do we need to give both `limit` and `chunksize` – Happy Coder Feb 12 '20 at 05:55
  • 1
    Could be that `usecols` and `index_col` overlap. You do not need to use `nrows`. The option `chunksize` is needed to get the effect of reading the file in chunks. – FredrikHedman Feb 12 '20 at 06:42
  • okay. I mean when I read data it is showing `Index': 1` etc with each of the row. I don't need this and only need the data from CSV. – Happy Coder Feb 12 '20 at 07:08
  • 1
    That index is part of the `DataFrame` data structure. More specifically, it is a `RangeIndex` and is the default index type used by `DataFrame`. For more details I highly recommend https://pandas.pydata.org/pandas-docs/stable/getting_started/ – FredrikHedman Feb 12 '20 at 16:47