0

Context

Hi all, i'm trying to split up my dataset into 180 unique pieces and then run it through a geocoder (my n is ~180,000 and the geocoder has a 1,000 batch limit). I'm pretty new to Python but some googling led me to shuffle within sklearn.utils. It seems to do the trick and this code here does what I want (conceptually):

from sklearn.utils import shuffle

df = shuffle(addresses)
df1 = df[0:1000]
df2 = df[1000:2000]
df3 = df[2000:3000]

However, I obviously don't want to sit down and manually construct 180 dataframes like this so am looking for a way to put it in a loop. This is my basic idea:

start = 0
end = 1000
for a in range(1,180):
    print(start, end, a)
    start = start+1000
    end = end+1000

The above works fine.

Code that doesn't work

However when I try and integrate the actual splitting into the loop (not just printing) it fails. I'm pretty sure the issue is in how i'm calling the macro a when i'm naming the dataframes. I have no idea how to solve this though.

from sklearn.utils import shuffle
df = shuffle(addresses)

start = 0
end = 1000
for a in range(1,180):
    df_str(a) = df[start:end]
    start = start+1000
    end = end+1000
C.Robin
  • 1,085
  • 1
  • 10
  • 23

2 Answers2

1

Potential fix:

df_str = dict()
for a in range(1,180):
    df_str[a] = df[start:end]
    start += 1000
    end += 1000

Possible previous bugs:

  • Make sure you define df_str. Recommended to be a dictionary.

  • It seems that you are calling df_str as a function with the round brackets, square braces is more commonly used for element access.

Siong Thye Goh
  • 3,518
  • 10
  • 23
  • 31
  • Thanks for your comment Siong. Unfortunately, this hasn't resolved my issue. My understanding is that I am defining df_; i simply want to add a suffix onto the string value df_ that takes the string value of integer `a`. This is what I have now: `start = 0` `end = 1000` `for a in range(1,180):` `b = str(a)` `print(b)` `df_[b] = df[start:end]` `start = start+1000` `end = end+1000` Yet it returns `TypeError: 'str' object does not support item assignment` – C.Robin Jan 27 '18 at 17:31
  • Your latest error is due to you define df_ as a string and a string is not mutable in python. You might want to explore structure like a dictionary. Then you can store the i-th sub data frame as the i-th entry. – Siong Thye Goh Jan 27 '18 at 17:38
  • Thanks again for your comment Siong. It still hasn't resolved my issue though. The dictionaries are empty after the loop. Besides, i don't think fitting this in with my other code is possible. I really need them in dataframes. Do you know a way to do that? – C.Robin Jan 27 '18 at 17:58
  • 1
    [this link](https://stackoverflow.com/a/13603268/8926905) might be of your interest. – Siong Thye Goh Jan 27 '18 at 18:02
  • Thanks. I'll look into this more carefull. The post you link advises that I only do what i'm trying to if i absolutely know what i'm doing; I definitely don't but there is a tradeoff with how much time I want to invest in getting to that place before I actually solve the problem I have. Really useful though -- appreciated – C.Robin Jan 27 '18 at 18:46
1

You can try using the exec() function to execute the dataframes you created. Here, the "format()" method is used to change the name of the dataframe. For example, if a=1 then {} = data'.format("df_%d" %(a))) will rename the data as df_1

   start = 0
   end = 1000
   df_str = dict()
   for a in range(1,180):
       df_str[a] = df.iloc[start:end]
       data= df_str[a]
       exec('{} = data'.format("df_%d" %(a)))
       start = start+1000
       end = end+1000
       del data

If you want the dataframes to have indexes starting from 0 instead you can reset the index using reset_index(drop=True):

exec('{} = data'.format("df_%d.reset_index(drop=True)" %(a)))
hilaliya
  • 26
  • 3
  • An explanation of your code would help to improve the quality of your post. Please if possible, add comments or explanation. – Ruli Nov 19 '20 at 09:26