0

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.

I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')

I took the following from this question

df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()

in order to separate the lists :

dgroups =[]
for i in range(0,len(authors)-1,2):
    dgroups.append(df.loc[df.B==authors[i]])
    dgroups.extend(df.loc[df.B ==authors[i+1]])

but instead it gives me sub-lists like this:

dgroups = [['A'],['B'],

       [tweet,author],

       ['A'],['B'],

       [tweet,author2]]

prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:

for i in authors:
    groups.append(df.loc[df.B==i])

so how would i do that for 2 authors or 3 authors or like that?

EDIT: from @Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :

dgroups= []
for i in range(2,len(authors)+1,2):
    tempset1=[]
    tempset2=[]
    tempset1 = df.loc[df.B==authors[i-2]]
    if(i-1 != len(authors)):
        tempset2=df.loc[df.B ==authors[i-1]]
        dgroups.append(tempset1.append(tempset2))
    else:
        dgroups.append(tempset1)

TestUser1
  • 13
  • 5

1 Answers1

0

This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.

pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()

authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
    # print(authors[i-auths_in_subset:i])
    dft = df[df.B.isin(authors[i-auths_in_subset:i])]
    # print(dft)
    dft.to_csv('df' + str(i) + '.csv')
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • Thanks for answering. This didn't work as it gave the column names in 23 different sets and didn't take the data, however, it kept giving empty sets. but i learned how to iterate over the whole list with your method. it seems appending 2 times in a row doesn't work, but if i created 2 temporary lists and then i appended the first to the second, then appended them to the lists (that i called dgroups) then it works as i intended. but this doesn't seem efficient for n > 3 so i guess this is just a way around it but not a good fix. – TestUser1 May 20 '21 at 08:38
  • Can you post the dataset via link? Or share a few rows from ea few authors in this post. Always best to work with actual data. – Jonathan Leon May 20 '21 at 14:04
  • it works, it was just referencing the index instead of B. see updated answer – Jonathan Leon May 21 '21 at 19:47
  • yes that worked. thank you so much. i don't have 15 points yet otherwise i would have upvoted your solution. thank you – TestUser1 May 22 '21 at 15:23