1

I have a table which has internal SKUs in column 0 and then synonyms along that row. The number of synonyms is not constant (ranging from 0 to 7, but will have a tendency to grow)

I need an effient function which will allow me to get SKUs from one column in a large table and translate them to synonym 0 from my other table.

This is my current function which takes an array of SKUs from one table, searches for them in another and gives me the first column value where it finds a synonym.

def new_array(dfarray1, array1, trans_dic): 
    missing_values = set([])
    new_array = [] 
    for value in array1: 
        pos = trans_dic.eq(str(value)).any(axis=1)
        if len(pos[pos]) > 0 : 
            new_array.append(trans_dic['sku_0'][pos[pos].index[0]])
        else: 
            missing_values.add(str(value))
    if len(missing_values) > 0 :         
        print("The following values are missing in dictionary. They are in DF called:"+dfarray1)
        print(missing_values)
        sys.exit()
    else: 
        return new_array

I'm sure that this is very badly written because it takes my laptop about 3 minutes to go through about 75K values only. Can anyone help me make this faster?


Some questions asked previously:

What types are your function parameters? (can guess pandas, but no way to know for sure)

Yes. I am working on two pandas dataframes.

What does your table even look like?

Dictionary table:

SKU0 Synonym 0 Synonym 1 Synonym 2
foo bar bar1
foo1 baar1
foo2 baaar0 baar2

Values table:

SKU Value Value1 value1
foo 3 1 7
baar1 4 5 7
baaar0 5 5 9

Desired table:

SKU Value Value1 value1
foo 3 1 7
foo1 4 5 7
foo2 5 5 9

What does the rest of your code that is calling this function look like?

df1.sku = new_array('df1', list(df1.sku), sku_dic)

mixcocam
  • 13
  • 4

1 Answers1

0

Given the dictionary dataframe in the format

df_dict = pd.DataFrame({
    "SKU0": ["foo", "foo1", "foo2"],
    "Synonym 0": ["bar", "baar1", "baaar0"],
    "Synonym 1": ["bar1", np.nan, np.nan],
    "Synonym 2": [np.nan, np.nan, "baar2"]
})

and a values dataframe in the format

df_values = pd.DataFrame({
    "SKU": ["foo", "baar1", "baaar0"],
    "Value": [3, 4, 5],
    "Value1": [1, 5, 5],
    "value1": [7, 7, 9]
})

you can get the output you want by first using pd.melt to restructure your dictionary dataframe and then join it to your values dataframe. Then you can use some extra logic to check which column to take the final value from and to select the final columns needed.

(
    df_dict
    # converts dict df from wide to long format
    .melt(id_vars=["SKU0"])
    # filters rows where there is no synonym
    .loc[lambda x: x["value"].notna()]
    # join dictionary with values df
    .merge(df_values, how="right", left_on="value", right_on="SKU")
    # get final value by taking the value from column "SKU0" if available, else "SKU"
    .assign(SKU = lambda x: np.where(x["SKU0"].isna(), x["SKU"], x["SKU0"]))
    # select final columns needed in output
    [["SKU", "Value", "Value1", "value1"]]
)

# output
    SKU     Value   Value1  value1
0   foo     3       1       7
1   foo1    4       5       7
2   foo2    5       5       9
Oxbowerce
  • 430
  • 5
  • 14
  • @mixcocam Even if the tables are of different lenghts this approach will work, the only thing you might need to change is how the two tables are merged depending on how you want to treat these cases. – Oxbowerce Feb 09 '21 at 17:31
  • Thank you @oxbowerce. Is there any way to easily extract the SKUs that are not in my dictionary using your method? – mixcocam Feb 13 '21 at 09:03
  • Yes, if you change the merge type to `"outer"` and the argument `argument=True` you can see which records are in which table, see also [this answer](https://stackoverflow.com/a/38242463/9435355). – Oxbowerce Feb 13 '21 at 11:04
  • Do you know if this is possible using SQL? It would be great to achieve this transformation directly using sqlite. – mixcocam Oct 28 '21 at 18:12
  • This is definitely possible using SQL. While normally you could use `UNPIVOT` to convert the data from wide to long format, SQLite doesn't have this operator so you need to use several `SELECT` statements with a `UNION ALL` in between. See [this sqlfiddle example](http://sqlfiddle.com/#!7/5193d/8). – Oxbowerce Oct 29 '21 at 08:28