Extracting subset of training data based on label

Question

I am given training data and their corresponding labels (integers 1,2,...,9) in two text files. Both text files are sequences of numbers.

The first 500 numbers in the training set correspond to the first data point, the second 500 numbers correspond to the second data point, etc.

I want to extract the subset of training points which have label 2 or label 3. My implementation of this is extremely slow:

import numpy as np

ytrain_old = np.genfromtxt('TrainLabels.txt')
Xtrain_old = np.genfromtxt('Train.txt')

Xtrain = []
ytrain = []

for i in range(10000):
    if (ytrain_old[i]==2) or (ytrain_old[i]==3):
        ytrain.append(ytrain_old[i])
        Xtrain.append([Xtrain_old[i*500:(i+1)*500]])

What would be a better way to do this? I would prefer to have it as a pandas dataframe actually.

Can you explain what (and why) are you doing in `Xtrain[i*700:(i+1)*700]`? — MaxU - stand with Ukraine, Nov 21 '17 at 12:57
Oh, that should be Xtrain_old rather than Xtrain. What I am trying to do is: for each label which is either 2 or 3 I want to access the corresponding test data (i.e. the corresponding 500 numbers) @MaxU — denmarksucks, Nov 21 '17 at 13:11
Can you add the labels, you can do that using simple groupby i.e `ndf = pd.concat([Xtrain_old,ytrain_old],1)` then `train = ndf.groupby('y_train_column_header').head(500)` then a boolean indexing `train = train[train['y_train_column_header'].isin([2,3])]` Later you can split them into y_train and x_train. — Bharath M Shetty, Nov 21 '17 at 13:37
Thank you @Bharath. How must I load the .txt files to make the concatenation work? Using np.genfromtxt does not work. — denmarksucks, Nov 21 '17 at 14:32
You can use `pd.read_csv()` that would be much better. This might help https://stackoverflow.com/questions/21546739/load-data-from-txt-with-pandas — Bharath M Shetty, Nov 21 '17 at 14:35

score 0 · Answer 1 · answered Nov 21 '17 at 13:49

First off all, i would merge xtrain and ytrain. For that we need to pivot your x frame:

xtrain_old = pd.Series(np.random.random(10000)).to_frame()
ytrain_old = pd.Series(np.random.randint(5, size=20))

xtrain_old['column_names'] = 'feature_'+ (xtrain_old.index%500).astype(str)
xtrain_old.index = np.floor(xtrain_old.index/500).astype(int)
xtrain_old = xtrain_old.pivot(columns='column_names')
xtrain_old.columns = xtrain_old.columns.droplevel()

Now we can merge the label:

ytrain_old = ytrain_old.rename('label')
df = pd.concat([xtrain_old, ytrain_old], axis=1)

And select all rows with the label we care about:

df_selected = df.loc[df['label'].isin([2,3])]

score 0 · Accepted Answer · answered Nov 21 '17 at 14:11

0

What about:

sel = np.logical_or(ytrain_old == 2, ytrain_old == 3)
Xtrain = Xtrain_old.reshape((-1,500))[sel]
ytrain = ytrain_old[sel]

answered Nov 21 '17 at 14:11

lukas

561
4
12

Extracting subset of training data based on label

2 Answers2