0

Description

Basically my problem is about loading data from CSV files. I already made a code able to load a given number of columns inside arrays (see example). Now I would like to improve the code so I can change the number of column to read and load without modifying my code everytime. Said an other way, I would like my code to dynamically adapt to the number of columns I choose. Let me give you an example of my present code.

Code example

Steps :

1. With Tkinter I select the files I want to load, this part of the code returns file_path, containing the several file paths.

2 Then I define the useful parameters for CSV reading. I create the arrays I want to be loaded with my datas, and then I load the datas.

n = len(file_path)    # number of files

# here I just determine the size of each files with a custom function, m is the maximum size
all_size , m = size_data(file_path,row_skip,col_to_read,delim)

# I create the arrays
shape = (n, m)
time = zeros(shape)
CH1 = zeros(shape)

# define CSV parameters before using loadtxt
row_skip = 5
delim = ';'
col_to_read = (0,1)    # <= This is where I choose the columns to be read

# I load the arrays
for k in range(0, len(file_path)):
    end = all_size[k]    # this is the size of the array to be loaded.
                         # I do this in order to avoid the annoying error
                         # ValueError: could not broadcast input array from shape (20) into shape (50)

    time[k][:end], CH1[k][:end] = loadtxt(file_path[k],
                                           delimiter=delim,
                                           skiprows=row_skip,
                                           usecols=col_to_read,
                                           unpack=True)

My problem is that if each file has 3 columns, i.e col_to_read = (0,1,2), I have to add a new array CH2 = zeros(shape) during creation and during loading. I would like a solution that is dynamically adapting to the number of columns I want to load. Only col_to_read would be hand changed. Ideally I would like to implement this code inside a function, because I do a lot of data analysis and I don't want the same code being pasted on every program.

First idea

I already found a way to dynamically create a given number of zeros arrays (see here). That's quite direct.

dicty = {}
for i in file_path:
    dicty[i] = []

this seems good, but now I would like to make the last line working whatever the number of variables. I believe there is a convenient way to adapt my code and use this dicty, but there's something I don't understand and I'm stuck.

I would appreciate any help.

Aldehyde
  • 35
  • 8

1 Answers1

0

Well, I found a solution to this problem I had in my mind since few weeks. Asking it here surely helped me make the problem clearer.

I learned more about dictionaries, as it was something new for me, and I understood it was very powerfull. I could replace the whole code by few lines :

def import_data(file_path,row_skip,col_to_read,delim):

# file_path is all the PATH strings of CSV files
# row_skip just to start loading from certain row
# col_to_read = (0,1,2), where I choose the col to read
# delim = ';' the delimiter for my CSV files

    dicty = {}                       # create ditcionary
    for i in file_path:              # in order to associate each file
        dicty[i] = []                # with n columns

    for k in range(0, len(file_path)):
        dicty[file_path[k]] = loadtxt(file_path[k], delimiter=delim,
                                      skiprows=row_skip, usecols=col_to_read,
                                      unpack=True)

    # it gives
    # dicty = {'my_file1.csv': array([1,2,3]),
    #          'my_file2.csv': array([2,4,6]),
    #          'my_file3.csv': array([5,10,15])}

    return dicty

This is quite straightforward. The first entry of the dictionary will be filled with all the columns, and so on, and I don't need to tell the dictionary how much col I will give to it. Then to read the data I use dicty.get(file_path[0]) for example. This is maybe not optimal but I can surely create variables with for loop in order to get rid of the dicty.get() method.

Tell me what you think about it, especially about calculation time. Sometimes I have 20 files with 200 000 rows 3 col. Maybe I could optimize loading.

Aldehyde
  • 35
  • 8