1

I am trying to obtain a list of tuples from a panda's DataFrame. I'm more used to other APIs like apache-spark where DataFrames have a method called collect, however I searched a bit and found this approach. But the result isn't what I expected, I assume it is because the DataFrame has aggregated data. Is there any simple way to do this?

Let me show my problem:

print(df)

#date       user            Cost       
#2016-10-01 xxxx        0.598111
#           yyyy        0.598150
#           zzzz       13.537223
#2016-10-02 xxxx        0.624247
#           yyyy        0.624302
#           zzzz       14.651441

print(df.values)

#[[  0.59811124]
# [  0.59814985]
# [ 13.53722286]
# [  0.62424731]
# [  0.62430216]
# [ 14.65144134]]

#I was expecting something like this:
[("2016-10-01", "xxxx", 0.598111), 
 ("2016-10-01", "yyyy", 0.598150), 
 ("2016-10-01", "zzzz", 13.537223)
 ("2016-10-02", "xxxx", 0.624247), 
 ("2016-10-02", "yyyy", 0.624302), 
 ("2016-10-02", "zzzz", 14.651441)]

Edit

I tried what was suggested by @Dervin, but the result was unsatisfactory.

collected = [for tuple(x) in df.values]

collected

[(0.59811124000000004,), (0.59814985000000032,), (13.53722285999994,),
 (0.62424731000000044,), (0.62430216000000027,), (14.651441339999931,), 
 (0.62414758000000026,), (0.62423407000000042,), (14.655454959999938,)]
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93

1 Answers1

2

That's a hierarchical index you got there, so first you can do what is in this SO question, and then something like [tuple(x) for x in df1.to_records(index=False)]. For example:

 df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [12]: df1
Out[12]: 
          a         b         c         d
0  0.076626 -0.761338  0.150755 -0.428466
1  0.956445  0.769947 -1.433933  1.034086
2 -0.211886 -1.324807 -0.736709 -0.767971
...

In [13]: [tuple(x) for x in df1.to_records(index=False)]
Out[13]: 
[(0.076625682946709128,
  -0.76133754774190276,
  0.15075466312259322,
  -0.42846644471544015),
 (0.95644517961731257,
  0.76994677126920497,
  -1.4339326896803839,
  1.0340857719122247),
 (-0.21188555188408928,
  -1.3248066626301633,
  -0.73670886051415208,
  -0.76797061516159393),
...
Community
  • 1
  • 1
Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217
  • Please try `[tuple(x) for x in df1.to_records(index=False)]`, but after you've grouped your replies like in the other SO answer. – Dervin Thunk Oct 07 '16 at 19:36