0

I'm trying to find a more pythonic (and faster) way to iterate through a dataframe and split a string dependent on the conditions. For example I have:

df = pd.DataFrame(
    [
        ["ab", "abcdef"],
        ["de", "abdghi"],
        ["gh", "0"],
        ["ij", "123-456-789"],
        [kl, "afzhkp"],
        ["mn", "qtbkpse"],
    ],
    columns=list("c1", "Office")
)

And my current code:

    for index, row in df.iterrows():
        if row["Office"] == "0" or pd.isnull(row["Office"]):
            df.loc[index, "N_1"] = "Unknown"
            df.loc[index, "N_2"] = "Unknown"
        elif row["Office"].startswith("abc") or row["Office"].startswith("abd"):
            df.loc[index, "N_1"] = row["Office"][:3]
            df.loc[index, "N_2"] = row["Office"][3:]
        elif row["Office"].startswith("Bu"):
            df.loc[index, "N_1"] = row["Office"].split(" ")[0]
            df.loc[index, "N_2"] = row["Office"].split(" ")[1]
        elif "-" in row["Office"]:
            df.loc[index, "N_1"] = row["Office"].split("-")[1]
            df.loc[index, "N_2"] = row["Office"].split("-")[2]
        else:
            if len(row["Office"]) == 6:
                df.loc[index, "N_1"] = row["Office"][:3]
                df.loc[index, "N_2"] = row["Office"][3:]
            else:
                df.loc[index, "N_1"] = row["Office"][:4]
                df.loc[index, "N_2"] = row["Office"][4:]

The intention would then be to have 2 new columns: N_1 and N_2 with values in each.

I'd like to condense all these if statements somehow, and to speed it up my moving away from iterrows to something more orientated towards what pandas was designed for (so "where" etc. instead of the iterrows)

Expected df:

df = pd.DataFrame(
    [
        ["ab", "abcdef", "abc, "def"],
        ["de", "abdghi", "abd", "ghi"],
        ["gh", "0", "Unknown", "Unknown"],
        ["ij", "123-456-789", "456", "789"],
        ["kl", "afzhkp", "afz", "hkp"],
        ["mn", "qtbkpse", "gtbk", "pse"],
    ],
    columns=list("c1", "Office", "N_1", "N_2")
)
pymat
  • 1,090
  • 1
  • 23
  • 45

1 Answers1

0

Here's what I would do to make your code faster:

  1. Define a funcion as follows:

     def func(row):
         if row["Office"] == "0" or pd.isnull(row["Office"]):
             row["N_1"] = "Unknown"
             row["N_2"] = "Unknown"
         elif row["Office"].startswith("abc") or row["Office"].startswith("abd"):
             row["N_1"] = row["Office"][:3]
             row["N_2"] = row["Office"][3:]
         elif row["Office"].startswith("Bu"):
             row["N_1"] = row["Office"].split(" ")[0]
             row["N_2"] = row["Office"].split(" ")[1]
         elif "-" in row["Office"]:
             row["N_1"] = row["Office"].split("-")[1]
             row["N_2"] = row["Office"].split("-")[2]
         else:
             if len(row["Office"]) == 6:
                 row["N_1"] = row["Office"][:3]
                 row["N_2"] = row["Office"][3:]
             else:
                 row["N_1"] = row["Office"][:4]
                 row["N_2"] = row["Office"][4:]
         return row
    
  2. Define the DataFrame:

     df = pd.DataFrame([
              ['ab', "abcdef"],
              ['de', "abdghi"],
              ['gh', "0"],
              ['ij', "123-456-789"],
              ['kl', "afzhkp"],
              ['mn', "qtbkpse"],
          ],
          columns=["c1", "Office"])
    
  3. Use apply() method to apply the condition to every row:

     df.apply(func, axis=1)
    

This should return the following:

       c1       Office      N_1      N_2
    0  ab       abcdef      abc      def
    1  de       abdghi      abd      ghi
    2  gh            0  Unknown  Unknown
    3  ij  123-456-789      456      789
    4  kl       afzhkp      afz      hkp
    5  mn      qtbkpse     qtbk      pse

Hope this helps!