10

This is a question that usually appears in interviews.

I know how to read csv files using Pandas.

However I am struggling to find a way to read files without using external libraries.

Does Python come with any module that would help read csv files?

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
  • A dataframe can be seen as a collection of records or as a list of columns. Numpy (and pandas) are mainly C or Cython optimizations to speedup the processing of large data frames, but you implement everything *by hand*. Only posting a comment because the current question is rather broad. – Serge Ballesta Mar 28 '19 at 18:08

5 Answers5

15

You most likely will need a library to read a CSV file. While you could potentially open and parse the data yourself, this would be tedious and time consuming. Luckily python comes with a standard csv module that you won't have to pip install! You can read your file in like this:

import csv

with open('file.csv', 'r') as file:
    my_reader = csv.reader(file, delimiter=',')
    for row in my_reader:
        print(row)

This will show you that each row is being read in as a list. You can then process it based on index! There are other ways to read in data too as described at https://docs.python.org/3/library/csv.html one of which will create a dictionary instead of a list!

update

You linked your github for the project I took the snip

product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3

Saved it as file.csv and ran it with the above code I posted. Result:

['product_id', 'product_name', 'aisle_id', 'department_id']
['9327', 'Garlic Powder', '104', '13']
['17461', 'Air Chilled Organic Boneless Skinless Chicken Breasts', '35', '12']
['17668', 'Unsweetened Chocolate Almond Breeze Almond Milk', '91', '16']
['28985', 'Michigan Organic Kale', '83', '4']
['32665', 'Organic Ezekiel 49 Bread Cinnamon Raisin', '112', '3']
['33120', 'Organic Egg Whites', '86', '16']
['45918', 'Coconut Butter', '19', '13']
['46667', 'Organic Ginger Root', '83', '4']
['46842', 'Plain Pre-Sliced Bagels', '93', '3']

This does what you have asked in your question. I am not going to do your project for you, you should be able to work it from here.

Reedinationer
  • 5,661
  • 1
  • 12
  • 33
  • What if I am supposed to use only Input and Output Libraries. Can I use an import CSV library? – Mosali HarshaVardhan Reddy Mar 28 '19 at 18:19
  • @MosaliHarshaVardhanReddy What do you mean by "Input and Output Libraries"? `csv` comes with a `csv.reader()` and `csv.writer()` method. Does this make it qualify as an "Input and Output Library"? – Reedinationer Mar 28 '19 at 18:21
  • Instead of using the CSV reader. I may have to use the file.reader("file.csv") and convert it into a DataFrame – Mosali HarshaVardhan Reddy Mar 28 '19 at 18:27
  • 1
    I am confused. You want a DataFrame, but you refuse to use `numpy`. I don't think you can have it both ways...DataFrames are `numpy` specific as far as I'm aware. – Reedinationer Mar 28 '19 at 18:29
  • @MosaliHarshaVardhanReddy So you are saying it is a requirement for you to parse the data yourself? And you are not allowed to use even standard Python library modules? I guess to make a dataframe the best you could do is make a list of lists – Reedinationer Mar 28 '19 at 18:35
  • Yeah, I need to make a list of lists and then map them with the corresponding values. I am able to make some progress. I will post a GitHub link after I do my analysis in the comment. Thanks for the help. – Mosali HarshaVardhan Reddy Mar 28 '19 at 18:50
  • 1
    @MosaliHarshaVardhanReddy I would truly urge you to use the `csv` module unless specified otherwise (which in your post you say only `numpy` and `pandas` are excluded). Then you can either make an sql database using `sqlite3` or make a list of lists or a list of dictionaries to represent your data for analysis. I see no reason you should not be able to import anything at all. If that is the case though you're in for a helluva hard project that will be tedious and time consuming and neglect the best part of python: *not having to reinvent the wheel with each program* – Reedinationer Mar 28 '19 at 19:48
  • @MosaliHarshaVardhanReddy Nice job. Good luck on your interview then! – Reedinationer Apr 03 '19 at 22:48
2

Recently I got a very similar question that was made more complicated than this one on making a data structure without using pandas. This is the only relevant question I have found so far. If I take this question, then what I was asked was: put the product id as keys to a dictionary and then put list of tuples of aisle and department ids as values (in python). The dictionary is the required dataframe. Of course I could not do it in 15 min (rather in 2 hours). It is hard for me to think of outside of numpy and pandas.

I have the following solutions, which also answers this question in the beginning. Probably not ideal but got what I needed.
Hopefully this helps too.

import csv
file =  open('data.csv', 'r')
reader = csv.reader(file)

items = []  # put the rows in csv to a list
aisle_dept_id = []  # to have a tuple of aisle and dept ids
mydict = {} # porudtc id as keys and list of above tuple as values in a dictionary

product_id, aisle_id, department_id, product_name = [], [], [], []

for row in reader:
    items.append(row)

for i  in range(1, len(items)):
    product_id.append(items[i][0])
    aisle_id.append(items[i][1])
    department_id.append(items[i][2])
    product_name.append(items[i][3])

for item1, item2 in zip(aisle_id, department_id):
    aisle_dept_id.append((item1, item2))
for item1, item2 in zip(product_id, aisle_dept_id):
    mydict.update({item1: [item2]})

With the output,

mydict:
{'9327': [('104', '13')],
 '17461': [('35', '12')],
 '17668': [('91', '16')],
 '28985': [('83', '4')],
 '32665': [('112', '3')],
 '33120': [('86', '16')],
 '45918': [('19', '13')],
 '46667': [('83', '4')],
 '46842': [('93', '3')]}
Manjit P.
  • 85
  • 1
  • 7
2

When one's production environment is limited by memory, being able to read and manage data without importing additional libraries may be helpful.

In order to achieve that, the built in csv module does the work.

import csv

There are at least two ways one might do that: using csv.Reader() or using csv.DictReader().

csv.Reader() allows you to access CSV data using indexes and is ideal for simple CSV files (Source).

csv.DictReader() on the other hand is friendlier and easy to use, especially when working with large CSV files (Source).

Here's how to do it with csv.Reader()

>>> import csv
>>> with open('eggs.csv', newline='') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print(', '.join(row))
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

Here's how to do it with csv.DictReader()

>>> import csv
>>> with open('names.csv', newline='') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...         print(row['first_name'], row['last_name'])
...
Eric Idle
John Cleese

>>> print(row)
{'first_name': 'John', 'last_name': 'Cleese'}

For another example, check Real Python's page here.

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
1

Had a similar requirement and came up with this solution; a function that converts csv to json (needed json for readability and to make querying the data easier without having access to Pandas). If the headers arguement of the function is True, the first row of the csv is used keys in the json, otherwise value indices are used as keys.

from csv import reader as csv_reader

def csv_to_json(csv_path: str, headers: bool) -> list:
  '''Convert data from a csv to json'''
  # store json data
  json_data = []
  
  try:
    with open(csv_path, 'r') as file:
      reader = csv_reader(file)
      # set column names using first row
      if headers:
        columns = next(reader)
      
      # convert csv to json
      for row in reader:
        row_data = {}
        for i in range(len(row)):
          # set key names
          if headers:
            row_key = columns[i].lower()
          else: 
            row_key = i
          # set key/value
          row_data[row_key] = row[i]
        # add data to json store 
        json_data.append(row_data)
        
  # error handling
  except Exception as e:
    print(repr(e))
    
  return json_data

Given a csv containing the following

+------+-------+------+
| Year | Month | Week |
+------+-------+------+
| 2020 |    11 |   11 |
| 2020 |    12 |   12 |
+------+-------+------+

The output with headers is

[
  {"year": 2020, "month": 11, "week": 11},
  {"year": 2020, "month": 12, "week": 12}
]

The ouput without headers is

[
  {"0": 2020, "1": 11, "2": 11},
  {"0": 2020, "1": 12, "2": 12}
]
fjemi
  • 21
  • 3
0

The following solutions are inspired by this answer. The output content in the examples below is generated using the following input data:

data.csv

Id,name,age,height,weight
1,Alice,20,62,120.6
2,Freddie,21,74,190.6
3,Bob,17,68,120.0

In case you would like to pretty print the output in the examples given below, you could use the following:

import json
print(json.dumps(data, indent=4, sort_keys=True, default=str))

Solution 1 - Use csv.reader() to get a list of list objects

import csv


def read_csv(filepath: str):
    data = []
    with open(filepath, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:             
            data.append(row)  
        
    return data
        
        
data = read_csv('data.csv')
print(data)

Output

[['Id', 'name', 'age', 'height', 'weight'], ['1', 'Alice', '20', '62', '120.6'],
 ['2', 'Freddie', '21', '74', '190.6'], ['3', 'Bob', '17', '68', '120.0']]

To print the data line by line, you could also use the following:

print('\n'.join(', '.join(map(str,row)) for row in data))

Output:

Id, name, age, height, weight
1, Alice, 20, 62, 120.6
2, Freddie, 21, 74, 190.6
3, Bob, 17, 68, 120.0

Solution 2 - Use csv.DictReader() to get a list of dict objects

import codecs
import csv


def read_csv(filepath):
    with open(filepath, 'rb') as f:
        reader = csv.DictReader(codecs.iterdecode(f, 'utf-8'))
        data = list(reader)
        
    return data
        
        
data = read_csv('data.csv')
print(data)

Output

[{'Id': '1', 'name': 'Alice', 'age': '20', 'height': '62', 'weight': '120.6'}, 
 {'Id': '2', 'name': 'Freddie', 'age': '21', 'height': '74', 'weight': '190.6'}, 
 {'Id': '3', 'name': 'Bob', 'age': '17', 'height': '68', 'weight': '120.0'}]

Solution 3 - Use csv.DictReader() to get a dictionary of dict objects based on a primary key

import codecs
import csv


def read_csv(filepath):
    data = {}
    with open(filepath, 'rb') as f:
        reader = csv.DictReader(codecs.iterdecode(f, 'utf-8'))
        for row in reader:             
            key = row['Id']  # Assuming a column named 'Id' to be the primary key
            data[key] = row  
        
    return data
        
        
data = read_csv('data.csv')
print(data)

Output

{'1': {'Id': '1', 'name': 'Alice', 'age': '20', 'height': '62', 'weight': '120.6'}, 
 '2': {'Id': '2', 'name': 'Freddie', 'age': '21', 'height': '74', 'weight': '190.6'}, 
 '3': {'Id': '3', 'name': 'Bob', 'age': '17', 'height': '68', 'weight': '120.0'}}

Pretty printed output (using the code mentioned at the top of this answer):

{
    "1": {
        "Id": "1",
        "age": "20",
        "height": "62",
        "name": "Alice",
        "weight": "120.6"
    },
    "2": {
        "Id": "2",
        "age": "21",
        "height": "74",
        "name": "Freddie",
        "weight": "190.6"
    },
    "3": {
        "Id": "3",
        "age": "17",
        "height": "68",
        "name": "Bob",
        "weight": "120.0"
    }
}
Chris
  • 18,724
  • 6
  • 46
  • 80