Simplifying conditional arrays from a text file pandas python

Question

I'm trying to access data from a textfile and apply things such as normal tests, confidence intervals, ANOVA tests and so forth.

Is there an easier way of creating conditional arrays using pandas from my data without manualing typing out 36 lines of code as I have done below?

Later I will need to access different flavours within these packs, so I will need to do the formulation about 7 times otherwise.

revels_data = pd.read_csv("revels2.txt")
rd = revels_data

# packet sums
total_1 = (rd.loc[rd["Packet number"] == 1, "Contents"].sum())
total_2 = (rd.loc[rd["Packet number"] == 2, "Contents"].sum())
total_3 = (rd.loc[rd["Packet number"] == 3, "Contents"].sum())
total_4 = (rd.loc[rd["Packet number"] == 4, "Contents"].sum())
total_5 = (rd.loc[rd["Packet number"] == 5, "Contents"].sum())
total_6 = (rd.loc[rd["Packet number"] == 6, "Contents"].sum())
total_7 = (rd.loc[rd["Packet number"] == 7, "Contents"].sum())
total_8 = (rd.loc[rd["Packet number"] == 8, "Contents"].sum())
total_9 = (rd.loc[rd["Packet number"] == 9, "Contents"].sum())
total_10 = (rd.loc[rd["Packet number"] == 10, "Contents"].sum())
total_11 = (rd.loc[rd["Packet number"] == 11, "Contents"].sum())
total_12 = (rd.loc[rd["Packet number"] == 12, "Contents"].sum())
total_13 = (rd.loc[rd["Packet number"] == 13, "Contents"].sum())
total_14 = (rd.loc[rd["Packet number"] == 14, "Contents"].sum())
total_15 = (rd.loc[rd["Packet number"] == 15, "Contents"].sum())
total_16 = (rd.loc[rd["Packet number"] == 16, "Contents"].sum())
total_17 = (rd.loc[rd["Packet number"] == 17, "Contents"].sum())
total_18 = (rd.loc[rd["Packet number"] == 18, "Contents"].sum())
total_19 = (rd.loc[rd["Packet number"] == 19, "Contents"].sum())
total_20 = (rd.loc[rd["Packet number"] == 20, "Contents"].sum())
total_21 = (rd.loc[rd["Packet number"] == 21, "Contents"].sum())
total_22 = (rd.loc[rd["Packet number"] == 22, "Contents"].sum())
total_23 = (rd.loc[rd["Packet number"] == 23, "Contents"].sum())
total_24 = (rd.loc[rd["Packet number"] == 24, "Contents"].sum())
total_25 = (rd.loc[rd["Packet number"] == 25, "Contents"].sum())
total_26 = (rd.loc[rd["Packet number"] == 26, "Contents"].sum())
total_27 = (rd.loc[rd["Packet number"] == 27, "Contents"].sum())
total_28 = (rd.loc[rd["Packet number"] == 28, "Contents"].sum())
total_29 = (rd.loc[rd["Packet number"] == 29, "Contents"].sum())
total_30 = (rd.loc[rd["Packet number"] == 30, "Contents"].sum())
total_31 = (rd.loc[rd["Packet number"] == 31, "Contents"].sum())
total_32 = (rd.loc[rd["Packet number"] == 32, "Contents"].sum())
total_33 = (rd.loc[rd["Packet number"] == 33, "Contents"].sum())
total_34 = (rd.loc[rd["Packet number"] == 34, "Contents"].sum())
total_35 = (rd.loc[rd["Packet number"] == 35, "Contents"].sum())
total_36 = (rd.loc[rd["Packet number"] == 36, "Contents"].sum())

# create total array
a = np.array([total_1, total_2, total_3, total_4, total_5, total_6, total_7,
total_8, total_9, total_10, total_11, total_12, total_13, total_14, total_15,
total_16, total_17, total_18, total_19, total_20, total_21, total_22, total_23,
total_24, total_25, total_26, total_27, total_28, total_29, total_30, total_31,
total_32, total_33, total_34, total_35, total_36])

# mean confidence interval
print(st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a)))

Thanks!

EDIT:

dataset looks like:

Packet number,Flavour,Contents
1,orange,4
2,orange,3
3,orange,2
4,orange,4
5,orange,3
...
36,orange,3
1,toffee,4
2,toffee,3
...
1,chocolate,5
...

etc.

desired data:

for each of the flavour types i want an array/list of the contents to analyse, i.e.

for orange:

4
3
2
4
...

so i can then apply various tests on these newly created arrays

are you after: `a = rd.groupby('Packet number')['Contents'].sum()` or `a = rd[rd['Packet number'].between(1, 36)].groupby('Packet number')['Contents'].sum()`? — MaxU - stand with Ukraine, Apr 28 '17 at 15:03

MaxU - stand with Ukraine · Accepted Answer · 2017-04-28T17:45:44.860

1

IIUC you can do the following.

If you have only 36 distinct values (from 1 to 36) in the Packet number column:

a = rd.groupby('Packet number')['Contents'].sum()

If you have more and want to filter them first:

a = rd[rd['Packet number'].between(1, 36)].groupby('Packet number')['Contents'].sum()

UPDATE:

Source DF

In [233]: df
Out[233]:
   Packet number    Flavour  Contents
0              1     orange         4
1              2     orange         3
2              3     orange         2
3              4     orange         4
4              5     orange         3
5             36     orange         3
6              1     toffee         4
7              2     toffee         3
8              1  chocolate         5

simple boolean indexing

In [234]: df.loc[df.Flavour == 'orange', 'Contents']
Out[234]:
0    4
1    3
2    2
3    4
4    3
5    3
Name: Contents, dtype: int64

... plus sum

In [235]: df.loc[df.Flavour == 'orange', 'Contents'].sum()
Out[235]: 19

filter, groupby, aggregate

In [237]: df.loc[df.Flavour.isin(['orange','toffee'])].groupby('Flavour')['Contents'].sum()
Out[237]:
Flavour
orange    19
toffee     7
Name: Contents, dtype: int64

edited Apr 28 '17 at 17:45

answered Apr 28 '17 at 15:19

MaxU - stand with Ukraine

205,989
36
386
419

the second problem occurs when i want to sort by flavour, say i have the flavours orange, chocolate etc. I then want an array of all the values whereby flavour = "orange". i tried orange = rd.groupby('Flavour' = 'orange')['Contents'].sum() but i get a keyerror: false error – pow Apr 28 '17 at 15:52
any idea for this? ^^ – pow Apr 28 '17 at 16:51
@mystifier, could you post a small reproducible data set and desired data set? This will help to understand hoe does your data look like and to better understand what are you going to achieve – MaxU - stand with Ukraine Apr 28 '17 at 17:11
@mystifier, very good, thanks! Could you also add your desired data set? – MaxU - stand with Ukraine Apr 28 '17 at 17:21
@mystifier, i've updated my post - is that what you want? – MaxU - stand with Ukraine Apr 28 '17 at 17:41
@mystifier, glad i could help :-) Please don't forget to specify a sample and desired data sets when asking Pandas/Numpy/Scipy/SKLearn/etc. questions - this dramatically increases a chance to get a good, fast and __tested__ answer ;-) Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – MaxU - stand with Ukraine Apr 28 '17 at 17:53

Simplifying conditional arrays from a text file pandas python

1 Answers1