1

I have a large csv and would like to split it in e.g 4 parts with generated names in the loop e.g sub0,sub1,sub2,sub3. I can split routinely as following:

df=pd.DataFrame(np.random.randint(0,100,size=(20, 3)), columns=list('ABC'))

for i,chunk in enumerate(np.array_split(df, 4)):
    print(chunk.head(2)) #just to check
    print(chunk.tail(1)) #just to check

    sub+str(i)=chunk.copy() # this gives error

But with the assigning names in the last line, I get the expected error: SyntaxError: can't assign to operator.

Q: how to get sub0,..,sub3 by copying each chunk in the loop? Thank you!

physiker
  • 889
  • 3
  • 16
  • 30
  • Possible duplicate of [Python Pandas Dynamically Create a Dataframe](https://stackoverflow.com/questions/47109931/python-pandas-dynamically-create-a-dataframe) – yatu Mar 05 '19 at 11:13
  • best to create a dict with the names as keys: `chunks = {f'{sub}{i}':chunk for i, chunk in enumerate(np.array_split(df, 10))}` – Chris Adams Mar 05 '19 at 11:18
  • What is the expected output? 10 separate DataFrames? Adding the expected output to the question would make it a bit easier to answer. – John Sloper Mar 05 '19 at 11:41
  • @ChrisA could you check my edit please? I cant get the output with your line even though I know it is almost there – physiker Mar 05 '19 at 12:42

2 Answers2

1

Why would you want to create variables in a loop?

  • They are unnecessary: You can store everything in lists or any other type of collection
  • They are hard to create and reuse: You have to use exec or globals()

Using a list is much easier:

subs = []
for chunk in np.array_split(df, 10):
        print(chunk.head(2)) #just to check
        print(chunk.tail(1)) #just to check
        subs.append(chuck.copy())
Albert Alonso
  • 656
  • 1
  • 6
  • 21
  • Thanks @Albert, your comments are certainly valid. However I would need to have dataframe rather than list. I agree with you that my approach is not optimal, that's why I would like to know a better solution which gives me dataframes because I need to use them in several other functions for processing. – physiker Mar 05 '19 at 11:30
  • 1
    You can still access the data frame inside a list. You lose no functionality of its property just referencing changes: `my_list[0]`. Even use a dictionary:`my_dict['myfirstdf']`. – Parfait Mar 05 '19 at 13:51
1

Best way is to create a dict with the dynamic names as keys:

chunks = {f'{sub}{i}':chunk for i, chunk in enumerate(np.array_split(df, 10))}

If you absolutely insist on creating the frames as individual variables, then you could assign them to the globals() dictionary, but this method is NOT advised:

for i, chunk in enumerate(np.array_split(df, 10)):
    globals()['{}{}'.format(sub, i)] = chunk
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • Thanks. How do I access now all the new dataframes? when I check by %who DataFrame, I dont see any. perhaps also {} in {sub}, are typo? – physiker Mar 05 '19 at 13:01
  • @physiker I've updated to use .`format` instead of 'f-strings' in case you are using an older version of python – Chris Adams Mar 05 '19 at 13:15
  • 1
    Thank you, your first approach works better and I agree, it is more correct way. – physiker Mar 05 '19 at 16:36