1

I knew there was another post about such fit to multiple rows in pandas but that method isnt the one that I am searching for.

My problem:

I want to fit all the data in the rows of dataset A. In dataset A, I have 4 rows and each row has different data. I wanted to fit all data for all 4 rows and then transform it.

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

pda = pd.DataFrame({"input":pd.Series(["abc23d,efgh45,jklfj4","dfer56,efgh45,jklh45","abc23d,efgh66,jklfj7","abc23d,efgh45,jklfj4"]),
                   "label": pd.Series([1,2,3,1])})

label_encoder = LabelEncoder()
pda["encoded_input"] = pda["input"].apply(lambda x:x.split(",")).apply(label_encoder.fit_transform)

Current Result:(this here is wrong, because it transform each row and transform the same time. I do not want to fit and transform each row at one time because it will keep reset the fit vocabulary. I want to first fit all the data in all rows, we have more than 5 data so it should have values above 5, I tried to first combine all rows data and then make it to list and fit it but this will cost too expensive. ** I wanted to know a better and smarter way to reduce cost)

    input   label   encoded_input
0   abc23d,efgh45,jklfj4    1   [0, 1, 2]
1   dfer56,efgh45,jklh45    2   [0, 1, 2]
2   abc23d,efgh66,jklfj7    3   [0, 1, 2]
3   abc23d,efgh45,jklfj4    1   [0, 1, 2]

Expected Result:(Each row data with unique numeric and then assign after transform)

    input   label   encoded_input
0   abc23d,efgh45,jklfj4    1   [0, 1, 2]
1   dfer56,efgh45,jklh45    2   [0, 1, 2]
2   abc23d,efgh66,jklfj7    3   [0, 1, 2]
3   abc23d,efgh45,jklfj4    1   [0, 1, 2]
Anonymous
  • 477
  • 3
  • 12

1 Answers1

1

I will using

pda['ecode']=pda.input.str.split(',',expand=True).T.rank().T.values.tolist()
pda
                  input  label            ecode
0  abc23d,efgh45,jklfj4      1  [1.0, 2.0, 3.0]
1  dfer56,efgh45,jklh45      2  [1.0, 2.0, 3.0]
2  abc23d,efgh66,jklfj7      3  [1.0, 2.0, 3.0]
3  abc23d,efgh45,jklfj4      1  [1.0, 2.0, 3.0]

Update

pda['ecode']=pda.input.str.split(',').explode().astype('category').cat.codes.groupby(level=0).apply(list)
pda
                  input  label      ecode
0  abc23d,efgh45,jklfj4      1  [0, 2, 4]
1  dfer56,efgh45,jklh45      2  [1, 2, 6]
2  abc23d,efgh66,jklfj7      3  [0, 3, 5]
3  abc23d,efgh45,jklfj4      1  [0, 2, 4]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • thank you for your response but please read my current results comment. I also stated that I wanted the ecode to have unique number for their data. The current ecode you generated are based on each fit on each row and transform it at the same time. This will not capture the other rows data vocabulary. – Anonymous Aug 15 '19 at 00:31
  • Thank you but can you explain your codes? I dont understand why it works even without using labelencoder. Could you explain each function in there before I marked as answer? – Anonymous Aug 15 '19 at 00:35
  • @Anonymous labelencoder, in pandas is category data with their codes you can using pandas function instead calling sklearn `pd.factorized` `astype('category').cat.codes` – BENY Aug 15 '19 at 00:37
  • @Anonymous https://stackoverflow.com/questions/42196589/any-way-to-get-mappings-of-a-label-encoder-in-python-pandas – BENY Aug 15 '19 at 00:37
  • I tried to implement your code and it said Series has no explode method. How you able to even run it? – Anonymous Aug 15 '19 at 00:38
  • @Anonymous update your pandas to 0.25.0 or you check the function https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/53218939#53218939 – BENY Aug 15 '19 at 00:39
  • Thank you for explaining and let me check it and then I mark it as answer. – Anonymous Aug 15 '19 at 00:42
  • Thank you sir for helping to learn new things and more examples given to show and support your answer. I appreaciated it and will mark it as answer. – Anonymous Aug 15 '19 at 00:52
  • Can I also know what is the level for in groupby function? – Anonymous Aug 15 '19 at 00:57
  • @Anonymous it is `groupby` the `index` – BENY Aug 15 '19 at 01:02
  • Do you know why I cant fit it into logistic regression? X = pda["input"] y = pda["label"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) – Anonymous Aug 15 '19 at 02:55
  • from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train.to_numpy(), y_train.to_numpy()) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(logreg.score(X_train.to_numpy(), y_train.to_numpy()))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(logreg.score(X_test.to_numpy(), y_test.to_numpy()))) – Anonymous Aug 15 '19 at 02:55
  • The two above after running I received error saying. "setting an array element with a sequence." – Anonymous Aug 15 '19 at 02:56
  • @Anonymous what is input ? – BENY Aug 15 '19 at 03:20
  • my mistake, the input is actually the "ecode" in our case. – Anonymous Aug 15 '19 at 03:22
  • @Anonymous you need to do with X=pd.Dataframe(pda.ecode.tolist(),index=pda.index) – BENY Aug 15 '19 at 03:24
  • Thank you so much sir. But is there a way to create a dictionary that has all these labels? like what is "abc23d" unique number. I wanted to create a dictionary for mapping it from text to unique ID and vice versa, and since I use this method so I am new and hope you could guide me. – Anonymous Aug 16 '19 at 02:57