I knew there was another post about such fit to multiple rows in pandas but that method isnt the one that I am searching for.
My problem:
I want to fit all the data in the rows of dataset A. In dataset A, I have 4 rows and each row has different data. I wanted to fit all data for all 4 rows and then transform it.
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
pda = pd.DataFrame({"input":pd.Series(["abc23d,efgh45,jklfj4","dfer56,efgh45,jklh45","abc23d,efgh66,jklfj7","abc23d,efgh45,jklfj4"]),
"label": pd.Series([1,2,3,1])})
label_encoder = LabelEncoder()
pda["encoded_input"] = pda["input"].apply(lambda x:x.split(",")).apply(label_encoder.fit_transform)
Current Result:(this here is wrong, because it transform each row and transform the same time. I do not want to fit and transform each row at one time because it will keep reset the fit vocabulary. I want to first fit all the data in all rows, we have more than 5 data so it should have values above 5, I tried to first combine all rows data and then make it to list and fit it but this will cost too expensive. ** I wanted to know a better and smarter way to reduce cost)
input label encoded_input
0 abc23d,efgh45,jklfj4 1 [0, 1, 2]
1 dfer56,efgh45,jklh45 2 [0, 1, 2]
2 abc23d,efgh66,jklfj7 3 [0, 1, 2]
3 abc23d,efgh45,jklfj4 1 [0, 1, 2]
Expected Result:(Each row data with unique numeric and then assign after transform)
input label encoded_input
0 abc23d,efgh45,jklfj4 1 [0, 1, 2]
1 dfer56,efgh45,jklh45 2 [0, 1, 2]
2 abc23d,efgh66,jklfj7 3 [0, 1, 2]
3 abc23d,efgh45,jklfj4 1 [0, 1, 2]