1

Given an input like:

{'example_id': 0,
 'query': ' revent 80 cfm',
 'query_id': 0,
 'product_id': 'B000MOO21W',
 'product_locale': 'us',
 'esci_label': 'I',
 'small_version': 0,
 'large_version': 1,
 'split': 'train',
 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
 'product_description': None,
 'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
 'product_brand': 'Panasonic',
 'product_color': 'White'}

The goal is to output something that looks like:

Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan [TITLE] Panasonic [BRAND] White [COLOR] WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air [SEP] Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace [SEP] Detachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation [SEP] This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan [SEP] 0.35 amp [BULLETPOINT]

There's a few operations going on to generate the desired output following the rules:

  • If the values in the dictionary is None, don't add the content to the output string
  • If the values contains newline \n substitute them with [SEP] tokens
  • Concatenate the strings with in order that user specified, e.g. above follows the order ["product_title", "product_brand", "product_color", "product_bullet_point", "product_description"]

I've tried this that kinda works but the function I've written looks a little to hardcoded to look through the wanted keys and concatenate and manipulate the strings.


item1 = {'example_id': 0,
 'query': ' revent 80 cfm',
 'query_id': 0,
 'product_id': 'B000MOO21W',
 'product_locale': 'us',
 'esci_label': 'I',
 'small_version': 0,
 'large_version': 1,
 'split': 'train',
 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
 'product_description': None,
 'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
 'product_brand': 'Panasonic',
 'product_color': 'White'}

item2 = {'example_id': 198,
 'query': '# 2 pencils not sharpened',
 'query_id': 6,
 'product_id': 'B08KXRY4DG',
 'product_locale': 'us',
 'esci_label': 'S',
 'small_version': 1,
 'large_version': 1,
 'split': 'train',
 'product_title': 'AHXML#2 HB Wood Cased Graphite Pencils, Pre-Sharpened with Free Erasers, Smooth write for Exams, School, Office, Drawing and Sketching, Pack of 48',
 'product_description': "<b>AHXML#2 HB Wood Cased Graphite Pencils, Pack of 48</b><br><br>Perfect for Beginners experienced graphic designers and professionals, kids Ideal for art supplies, drawing supplies, sketchbook, sketch pad, shading pencil, artist pencil, school supplies. <br><br><b>Package Includes</b><br>- 48 x Sketching Pencil<br> - 1 x Paper Boxed packaging<br><br>Our high quality, hexagonal shape is super lightweight and textured, producing smooth marks that erase well, and do not break off when you're drawing.<br><br><b>If you have any question or suggestion during using, please feel free to contact us.</b>",
 'product_bullet_point': '#2 HB yellow, wood-cased pencils:Box of 48 count. Made from high quality real poplar wood and 100% genuine graphite pencil core. These No 2 pencils come with 100% Non-Toxic latex free pink top erasers.\nPRE-SHARPENED & EASY SHARPENING: All the 48 count pencils are pre-sharpened, ready to use when get it, saving your time of preparing.\nThese writing instruments are hexagonal in shape to ensure a comfortable grip when writing, scribbling, or doodling.\nThey are widely used in daily writhing, sketching, examination, marking, and more, especially for kids and teen writing in classroom and home.#2 HB wood-cased yellow pencils in bulk are ideal choice for school, office and home to maintain daily pencil consumption.\nCustomer service:If you are not satisfied with our product or have any questions, please feel free to contact us.',
 'product_brand': 'AHXML',
 'product_color': None}


def product2str(row, keys):
    key2token = {'product_title': '[TITLE]', 
     'product_brand': '[BRAND]', 
     'product_color': '[COLOR]',
     'product_bullet_point': '[BULLETPOINT]', 
     'product_description': '[DESCRIPTION]'}
    
    output = ""
    for k in keys:
        content = row[k]
        if content:
            output += content.replace('\n', ' [SEP] ') + f" {key2token[k]} "

    return output.strip()

product2str(item2, keys=['product_title', 'product_brand', 'product_color',
                        'product_bullet_point', 'product_description'])

Q: Is there some sort of native CPython JSON to str flatten functions/recipes that can achieve similar results to the product2str function?

Q: Or is there already some function/pipeline in tokenizers library https://pypi.org/project/tokenizers/ that can flatten a JSON/dict into tokens?

WLink
  • 23
  • 6
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    Your string format doesn't look like any standard format, so I don't think you'll find anything already written to do this. – Barmar Apr 21 '23 at 19:48
  • Are there some standard JSON to string format that I can refer to? I guess standardizing my format to a more standard one is better than rolling out my own format. – alvas Apr 21 '23 at 19:51
  • Do you mean to ask, "How could this be written using more existing Python stdlib functionality (and less custom hardcoded stuff)?" – Kache Apr 21 '23 at 19:52
  • JSON *is* a string format, why don't you just use that? – Barmar Apr 21 '23 at 19:53
  • Because some libraries in https://github.com/huggingface/tokenizers/tree/main/tokenizers wants to read in the data as a `str` type and having those brackets and colons in the string would mess up the "AI models". I think the model will learn, but in most cases, other devs/researchers want that "flatten token string" inputs rather than raw json string =( – alvas Apr 21 '23 at 20:22
  • There's no "industry standard" way to flatten a dict into a string that's not already some generic serialization format (like JSON), but a natural way is to just concatenate everything together: `' '.join(' '.join(map(str, pair)) for pair in some_dict.items())`. All that remains is mapping the keys and values according to your needs. – Kache Apr 21 '23 at 20:34
  • You could make your code a little more dry by passing `key2token` to `product2str`. Or perhaps generate `key2token` as `{ k : re.sub(r'product|_', '', k).upper() for k in keys }`. Also using a `join` on a list comprehension might be cleaner rather than the loop with concatenation – Nick Apr 22 '23 at 00:50
  • A common pythonic way to stringify part of a dictionary according to a template can be `"[TITLE] {product_title}".format(**item1)`. However, this method has no way to omit the output if the element is None. – ken Apr 26 '23 at 09:18

2 Answers2

2

To me it seems to be crystal clear that keys should be a global variable, I guess you would call the function with the same keys argument repeatedly, so it would be better if you make it a global and not pass it as an argument unnecessarily.

Your tokens follow a clear pattern, you are removing 'product_' prefix and removing underscores and then convert to UPPERCASE, why not make a function to do this?

And while you can use dict comprehension to pre-generate the tokens, I advise against it because there wouldn't be any significant performance gain and you would do an implicit loop every time you query that dict.

I have shortened your code to this:

KEYS=['product_title', 'product_brand', 'product_color', 'product_bullet_point', 'product_description']
def tokenize(key: str) -> str:
    return key.removeprefix('product_').replace('_', '').upper()

def product2str(item: dict) -> str:
    return ' '.join(
        '{} [{}]'.format(v.replace('\n', '[SEP]'), tokenize(key))
        for key in KEYS
        if (v := item.get(key, None))
    )

I am afraid there isn't anything more that can be done, to the best of my knowledge.

Using your examples I get the following outputs:

Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan [TITLE] Panasonic [BRAND] White [COLOR] WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air[SEP]Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace[SEP]Detachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation[SEP]This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan[SEP]0.35 amp [BULLETPOINT]

AHXML#2 HB Wood Cased Graphite Pencils, Pre-Sharpened with Free Erasers, Smooth write for Exams, School, Office, Drawing and Sketching, Pack of 48 [TITLE] AHXML [BRAND] #2 HB yellow, wood-cased pencils:Box of 48 count. Made from high quality real poplar wood and 100% genuine graphite pencil core. These No 2 pencils come with 100% Non-Toxic latex free pink top erasers.[SEP]PRE-SHARPENED & EASY SHARPENING: All the 48 count pencils are pre-sharpened, ready to use when get it, saving your time of preparing.[SEP]These writing instruments are hexagonal in shape to ensure a comfortable grip when writing, scribbling, or doodling.[SEP]They are widely used in daily writhing, sketching, examination, marking, and more, especially for kids and teen writing in classroom and home.#2 HB wood-cased yellow pencils in bulk are ideal choice for school, office and home to maintain daily pencil consumption.[SEP]Customer service:If you are not satisfied with our product or have any questions, please feel free to contact us. [BULLETPOINT] <b>AHXML#2 HB Wood Cased Graphite Pencils, Pack of 48</b><br><br>Perfect for Beginners experienced graphic designers and professionals, kids Ideal for art supplies, drawing supplies, sketchbook, sketch pad, shading pencil, artist pencil, school supplies. <br><br><b>Package Includes</b><br>- 48 x Sketching Pencil<br> - 1 x Paper Boxed packaging<br><br>Our high quality, hexagonal shape is super lightweight and textured, producing smooth marks that erase well, and do not break off when you're drawing.<br><br><b>If you have any question or suggestion during using, please feel free to contact us.</b> [DESCRIPTION]
Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35
1

So I made this function that would do what you asked for

def flatten_dict(d, key_order):
    tokens = {
        "product_title": "[TITLE]",
        "product_brand": "[BRAND]",
        "product_color": "[COLOR]",
        "product_description": "[DESCRIPTION]",
        "product_bullet_point": "[BULLETPOINT]",
        # put your others token types here
    }
    parts = []
    for key in key_order:
        if key in d and d[key] is not None:
            parts.append(f"{d[key]} {tokens[key]}")
    return " ".join(parts)

item1 = {
    'example_id': 0,
    'query': ' revent 80 cfm',
    'query_id': 0,
    'product_id': 'B000MOO21W',
    'product_locale': 'us',
    'esci_label': 'I',
    'small_version': 0,
    'large_version': 1,
    'split': 'train',
    'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
    'product_description': None,
    'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
    'product_brand': 'Panasonic',
    'product_color': 'White'
}

keys = ["product_title", "product_brand", "product_color", "product_bullet_point", "product_description"]
output_str = flatten_dict(item1, keys)
print(output_str)

So basically I do what you have done, BUT, instead of making a string, I make a list, and I join it later on.

OUTPUT:

Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan [TITLE] Panasonic [BRAND] White [COLOR] WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air
Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace
Detachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation
This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan
0.35 amp [BULLETPOINT]
WLink
  • 23
  • 6