3

I want to write a python script that automatically reads files with the following extensions (csv, tsv, json, xml, xls, xlsx, hdf5, sql) and show the first 100 lines. The only parameter I will give to my script is the path.

This is my first try. I can use switch cases instead of if/elif for good practices but other than that do you guys have suggestions to improve the code ?

import os
import pandas as pd

file_path = input("enter the path to the file you want to open")
for file in file_path :
    if file.endswith('.csv', 'tsv') :
        df = pd.read_csv(file)
        print (df.head(100))
    elif file.endswith('.json') :
        df = pd.read_json(file)
        print (df.head(100))
    elif file.endswith('.xml') :
        df = pd.read_xml(file)
        print (df.head(100))
    elif file.endswith('.xls','xlsx') : 
        df = pd.read_excel(file)
        print (df.head(100))
    elif file.endswith('.hdf') :
        df = pd.read_hdf(file)
        print (df.head(100))
    elif file.endswith('.sql') :
        df = pd.read_sql(file)
        print (df.head(100))
    else:
        print("file format not supported")
Hamza
  • 41
  • 5
  • 1
    If you intend to open ```.hdf and .sql``` with ```read_excel()``` then why not put them in one place with ```file.endswith('.xls','xlsx')``` – Abhi Jul 30 '21 at 17:25
  • 7
    This is a bit hacky but you could make a dictionary that contains maps strings to functions. eg ```translateDict = {".csv":pd.read_csv, ".xml":pd.read_xml}``` That way you can write something like ```df = translateDict[fileExtension](file)``` – tcglezen Jul 30 '21 at 17:27
  • @Abhi sorry I just miscopied the code, fixed it for Hdf and SQL now – Hamza Jul 30 '21 at 18:23
  • 2
    `.sql` files don't work like that. There is no particular standard for the extension `.sql`, but the pandas `read_sql` function needs to be given an SQL query and an SQL connection (like from Sqlite or MySql). – Tim Roberts Jul 30 '21 at 19:59
  • 1
    This if-elif-else chain looks fine. Just put it into one function that returns the dataframe, and you're good to go. While the dictionary hack is nice, there'll be exceptions where this fails and you need specific code (see Tim Roberts comment, although SQL is not even a file format), possibly making it untenable in the end. – 9769953 Jul 30 '21 at 20:02
  • 1
    Just don't put the `print` function on every line. Put it at the bottom, quickly checking that `df` is not `None` (set `df` to `None` at the top or in the `else` branch) and printing it. That removes the `print()` or any other non-reading functionality out of the if-elif-else chain. – 9769953 Jul 30 '21 at 20:04
  • related: [Replacements for switch statement in Python?](https://stackoverflow.com/q/60208/10197418) – FObersteiner Aug 18 '21 at 09:17

3 Answers3

2

Just to condense the fruitful comment stack into a self-contained answer.

if-elif slightly condensed and converted to function

import os
import pandas as pd

def read_any(file):
    if file.endswith('.csv', 'tsv') :
        df = pd.read_csv(file)
    elif file.endswith('.json'):
        df = pd.read_json(file)
    elif file.endswith('.xml'):
        df = pd.read_xml(file)
    elif file.endswith('.xls','xlsx'):
        df = pd.read_excel(file)
    elif file.endswith('.hdf'):
        df = pd.read_hdf(file)           
    elif file.endswith('.sql'):
        df = pd.read_sql(file)
    else:
        raise ValueError(f'Unsupported filetype: {file}')
    return df

if __name__ == '__main__':
    # here or wherever it is used
    file_path = input("enter the path to the file you want to open")
    df = read_any(file_path)
    print(df.head(100))

Hacky (?) dictionary for functions

Now to the possible refactoring of the if-elif stack into a dictionary lookup. I don't find it hacky, but it's not really necessary. It's more of a stylistic choice:

import os.path

READER_MAP = {
    'xlsx': pd.read_excel,
    'xls': pd.read_excel,
    'xml': pd.read_xml, # .. and so on
    'sql': my_read_sql  # special case
}

def my_read_sql(con_w_sql, sep='#'):
    """Made-up function that reads SQL from CON in a single call
    
    Args:
        con_w_sql: connection string + query or table name,
            joined by separator string sep
        sep: separator

    Example:
        >>> my_read_sql('postgres:///db_name#table_name')
    """
    con, sql = special_path.split(sep)
    return pd.read_sql(sql, con)

And with all that, read_any would shorten to:

def read_any(file):
    _, ext = os.path.splitext(file)
    try:
        reader = READER_MAP[ext]
    except KeyError:
        raise ValueError(f'Unsupported filetype: {ext}')
    return reader(file)

I don't really like that one has to make up a non-standard convention to concatenate sql (or table names) with con (connection string). I would keep separate user-facing functions for file-like and DB-like reading.

Conclusion

Ok, now having written that out, I would recommend sticking to the if-elif solution. It handles special cases with much less fuzz, and can be refactored to call my_read_sql-type special handlers with the same line count as READER_MAP, but without that variable.

Also: the .endswith stack is more flexible with respect to double extensions (e.g. .csv.gz), which would be trickier to get right with the READER_MAP approach.

ojdo
  • 8,280
  • 5
  • 37
  • 60
1

To make it clean you can use Ternary Expression.Although it is if else but in a more clean way.Note that the code you provided and the answer both have a mistake.In both codes you have to provide tuple to give two or more values to endwith method i.e

file.endswith(('.csv', '.tsv'))
#this is correct
file.endswith('.csv', '.tsv')
#this is wrong
import os
import pandas as pd


file_path = input("enter the path to the file you want to open")
b = os.listdir(file_path)
for file in b:
    path = file_path + file
    df = pd.read_csv(path) if file.endswith(('.csv', '.tsv')) else None
    df = pd.read_json(file) if file.endswith('.json') else df
    df = pd.read_xml(file) if file.endswith('.xml') else df
    df = pd.read_excel(file) if file.endswith(('.xls','xlsx')) else df    
    df = pd.read_hdf(file) if file.endswith('.hdf') else df
    df = pd.read_sql(file) if file.endswith('.sql') else df
    if file.endswith(('.csv','.tsv','.json','.xml','.xls','.xlxs','.hdf','.sql')):
        print (df.head(100))
    else:
        print("format not supported")
farhan jatt
  • 509
  • 1
  • 8
  • 31
  • 1
    tested it, works great, thanks for the correction on the tuples, I have a question if you could help with, if instead of giving my code the Dir Path and let it walk through all files and apply the whole extension verification, is there a way to have a targeted path directly to the desired file ? – Hamza Jul 31 '21 at 17:13
  • 1
    You just have to use this instead of giving directory df = pd.read_csv("C:\\Users\\user\\Desktop\\test.csv") – farhan jatt Jul 31 '21 at 17:23
  • 2
    You might want to test `if df:` for the last test instead of repeating the multiple tests on extensions. – mozway Aug 01 '21 at 01:36
  • 1
    try except be more better i think – farhan jatt Aug 01 '21 at 01:50
0

thank you all for your suggestions the following code composed from your different suggestions + some stuff i added now works and can read datasets in different formats( still have problems with sql files) and returns the first 100 lines and Types (using Feature Tools instead of pandas) in a JSON Format, again thank you all for your ideas and help

import os
import modin.pandas as pd
import json
import featuretools as ft
import matplotlib.pyplot as plt
import numpy as np
import time
from distributed import Client


def sqlread(con_w_sql, sep='#'):
    con, sql = special_path.split(sep)
    df = pd.read_sql(sql, con)
    return df.head()

readict = {
     ".csv": {
        "read": pd.read_csv
     },
     ".tsv": {
        "read": pd.read_csv 
     },
     ".json": {
        "read": pd.read_json
     },
     ".xlsx": {
        "read": pd.read_excel 
     },
     ".xml": {
        "read": pd.read_xml 
     },
     ".xls": {
        "read": pd.read_excel 
     },
     ".hdf": {
        "read": pd.read_hdf 
     },
     ".sql": {
        "read": sqlread
     }

 }


def read_any(file):
    _, ext = os.path.splitext(file)
    df = readict[ext]["read"](file)     
    es = ft.EntitySet()
    es = es.entity_from_dataframe(entity_id="df",
                               dataframe = df)
    head = df.head(100).values.tolist()
    dtype = es["df"]

    dtypedict = dtype.variable_types
    types = []

    for value in dtypedict.values():
        types.append(value.type_string)

    return json.dumps(head), json.dumps(types)

file = input("enter the path to the file you want to open : ")
read_any(file)
Hamza
  • 41
  • 5