I'm trying to find a more pythonic (and faster) way to iterate through a dataframe and split a string dependent on the conditions. For example I have:
df = pd.DataFrame(
[
["ab", "abcdef"],
["de", "abdghi"],
["gh", "0"],
["ij", "123-456-789"],
[kl, "afzhkp"],
["mn", "qtbkpse"],
],
columns=list("c1", "Office")
)
And my current code:
for index, row in df.iterrows():
if row["Office"] == "0" or pd.isnull(row["Office"]):
df.loc[index, "N_1"] = "Unknown"
df.loc[index, "N_2"] = "Unknown"
elif row["Office"].startswith("abc") or row["Office"].startswith("abd"):
df.loc[index, "N_1"] = row["Office"][:3]
df.loc[index, "N_2"] = row["Office"][3:]
elif row["Office"].startswith("Bu"):
df.loc[index, "N_1"] = row["Office"].split(" ")[0]
df.loc[index, "N_2"] = row["Office"].split(" ")[1]
elif "-" in row["Office"]:
df.loc[index, "N_1"] = row["Office"].split("-")[1]
df.loc[index, "N_2"] = row["Office"].split("-")[2]
else:
if len(row["Office"]) == 6:
df.loc[index, "N_1"] = row["Office"][:3]
df.loc[index, "N_2"] = row["Office"][3:]
else:
df.loc[index, "N_1"] = row["Office"][:4]
df.loc[index, "N_2"] = row["Office"][4:]
The intention would then be to have 2 new columns: N_1 and N_2 with values in each.
I'd like to condense all these if statements somehow, and to speed it up my moving away from iterrows
to something more orientated towards what pandas was designed for (so "where" etc. instead of the iterrows
)
Expected df:
df = pd.DataFrame(
[
["ab", "abcdef", "abc, "def"],
["de", "abdghi", "abd", "ghi"],
["gh", "0", "Unknown", "Unknown"],
["ij", "123-456-789", "456", "789"],
["kl", "afzhkp", "afz", "hkp"],
["mn", "qtbkpse", "gtbk", "pse"],
],
columns=list("c1", "Office", "N_1", "N_2")
)