Iterate through 200 datasets

Question

I have 200 datasets and I want to iterate through them to pick random rows and add them to another dataset(empty dataset), using iloc and value function. when I execute the code it does not give an error but also does not add anything to the empty dataset. However, when I try to run the single command to check if the random row has any value or not it gives an error of: AttributeError: 'str' object has no attribute 'iloc'.

my code is given below:

Tdata = np.zeros([20, 6])
k = 0
for j in range(200):    
        for j1 in range(0, 20):
           Tdata[k:k+1,:] = (('dataset'+j)).iloc[random.randint(100)].values
           k += 1

('dataset'+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2......there are already defined.

Any reason this needs to be done with `iloc` and `randint` instead of with [`DataFrame.sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) -> `pd.concat([df.sample(n=20) for df in list_of_dfs])` — Henry Ecker, Sep 21 '21 at 03:58
If you'd like help debugging your existing code would you define the undefined elements in your program like `Tdata`, `k`? Also can you outline what you think `(('dataset'+j))` is doing here. Are you trying to access some defined variable? — Henry Ecker, Sep 21 '21 at 04:00
@Henry I see this kind of question every few days... They want many dataframes as variables, just to do something programmatically. @ TariqShah do *NOT* use one variable per dataset, use a *container*, or loop/read your data without keeping a copy — mozway, Sep 21 '21 at 04:09
Does this answer your question? [Creating multiple dataframes with a loop](https://stackoverflow.com/questions/48888001/creating-multiple-dataframes-with-a-loop) you can combine this with `sample` — mozway, Sep 21 '21 at 04:13
@HenryEcker actually I want to get random rows from the dataset and add them to Tdata. i will try what you have suggested — TariqShah, Sep 21 '21 at 06:40
@HenryEcker ('dataset'+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2......there are already defined. — TariqShah, Sep 21 '21 at 06:43
@mozway thanks for your suggestions, I will check the link you shared. — TariqShah, Sep 21 '21 at 06:44
there are multiple issues with your code. I have tried to summarize them. Do check and let me know if this was useful. — Akshay Sehgal, Sep 21 '21 at 07:01

Akshay Sehgal · Accepted Answer · 2021-09-21T07:13:25.077

There are multiple issues with you code.

1. Using `str` in place of the actual DataFrame variable

You are trying use .iloc over a string dataframe1 for example. This won't work since what str has no attribute .iloc, as the error reads for you.

Since you want to work with DataFrame variable names, you may need to use eval() to interpret the string as a variable name. NOTE: BE EXTRA CAREFUL while using eval(). Please read the dangers of using eval() carefully.

2. Sampling 20 rows from each DataFrame.

If you are trying to get 20 rows by using for j1 in range(0, 20): along with random.randint(100), there is a better way to avoid this iteration. Instead what you need is to use random.randint(0,100,(n,) to get n random indexes. In this case random.randint(0,100,(20,)

Or an even better way to do this is just simply using df.sample(20) to sample 20 rows from a given dataframe.

3. Forcing update over views of the dataframe

Its better to use a different appraoch than force an update over a view of the dataframe with Tdata[k:k+1,:] == .... Since you want to combine dataframes, its better to just collect them in a list and pass them to a pd.concat which would be much more useful.

Here is sample code with a simple setting which should help guide you to what you are looking for.

import pandas as pd
import numpy as np

dataset0 = pd.DataFrame(np.random.random((100,3)))
dataset1 = pd.DataFrame(np.random.random((100,3)))
dataset2 = pd.DataFrame(np.random.random((100,3)))
dataset3 = pd.DataFrame(np.random.random((100,3)))

##Using random.randint
##samples = [eval('dataset'+str(i)).iloc[np.random.randint(0,100,(3,))] for i in range(4)] 

##Using df.sample()
samples = [eval('dataset'+str(i)).sample(3) for i in range(4)]

##Change - 
##1. The 3 to 20 for 20 samples per dataframe
##2. range(4) to range(200) to work with 200 dataframes

output = pd.concat(samples)
print(output)

           0         1         2
42  0.372626  0.445972  0.030467
20  0.376201  0.445504  0.835735
56  0.214806  0.083550  0.582863
85  0.691495  0.346022  0.619638
24  0.290397  0.202795  0.704082
16  0.112986  0.013269  0.903917
51  0.521951  0.115386  0.632143
73  0.946870  0.531085  0.437418
98  0.745897  0.718701  0.280326
56  0.679253  0.010143  0.124667
4   0.028559  0.769682  0.737377
84  0.857553  0.866464  0.827472

4. Storing 200 dataframes??

Last but not the least, you should ask yourself, why are you storing 200 dataframe as individual variables, only to sample some rows from each.

Why not try to -

Read each of the files iteratively
Sample rows from each
Store them in a list of dataframes
pd.concat once you are done iterating over the 200 files

... instead of saving 200 dataframes and then doing the same.

Wow, amazing. Thanks for your help. I will consider all your four points. Thanks again. — TariqShah, Sep 21 '21 at 07:03
glad to help. please feel free to mark the answer if it helped. — Akshay Sehgal, Sep 21 '21 at 07:06