2
keywords = ['Small', 'Medium', 'Large']
df.column
0   The Small, Large, Medium
1   The fast Medium, Small XS
2   He was a Medium, Large or Small

How could I tell pandas if a row contains a keyword:

  1. Replace the keywords so that the keywords appear in the order of the list
  2. If the keyword contains a suffix, "XS", include that with step 1

Expected Output:

0 The Small, Medium, Large 
1 The fast Small XS, Medium
2 He was a Small, Medium or Large
asd
  • 1,245
  • 5
  • 14
  • Is `XS` the only valid suffix? If not, how do you know that `or` in row 2 is not a suffix for `Medium`? – Nick Feb 27 '21 at 22:35
  • yep ```XS``` is the only valid suffix – asd Feb 27 '21 at 22:35
  • Can there be duplicates of each keyword or only one of each e.g. are `Medium, Medium` or `Small, Small XS` valid? – Nick Feb 27 '21 at 22:47
  • There won't be duplicates in my case, but I guess the more general solution the better, if possible – asd Feb 27 '21 at 22:56
  • do you have an order between Small and Small XS, or want to preserve the order as they appear – Bing Wang Feb 27 '21 at 23:01

1 Answers1

3

One way to do this is to:

  1. Split the string into words which match the keywords (with or without the XS suffix), or other non-matching parts using re.findall
  2. Sort the words which match according to their index in the keywords list
  3. Rebuild the words list using the sorted keywords
  4. Join the string back together

You can do that with this function:

def sizesorter(s, keywords):
    words = re.findall(r'((?:\b(?:' + '|'.join(keywords) + r')\b)(?:\sXS)?|(?:[^\s]*(?:\s|$)))', s, re.I)
    sizes = iter(sorted([w for w in words if w.split(' ')[0] in keywords], key=lambda w:keywords.index(w.split(' ')[0])))
    words = [w if w.split(' ')[0] not in keywords else next(sizes) for w in words]
    return ''.join(words)

You can then apply that function to the column. For example:

import pandas as pd
import re

df = pd.DataFrame({ 'column' : ['The Small, Large, Medium',
'The fast Medium, Small XS',
'He was a Medium, Large or Small',
'small, Large a metre'
] })

def sizesorter(s, keywords):
    words = re.findall(r'((?:\b(?:' + '|'.join(keywords) + r')\b)(?:\sXS)?|(?:[^\s]*(?:\s|$)))', s, re.I)
    sizes = iter(sorted([w for w in words if w.split(' ')[0] in keywords], key=lambda w:keywords.index(w.split(' ')[0])))
    words = [w if w.split(' ')[0] not in keywords else next(sizes) for w in words]
    return ''.join(words)
    
df.column = df.column.apply(sizesorter, args=(['Small', 'Medium', 'Large'], ))

print(df)

Output:

                            column
0         The Small, Medium, Large
1        The fast Small XS, Medium
2  He was a Small, Medium or Large

Partial sorting of the list of words adapted from this answer.

Nick
  • 138,499
  • 22
  • 57
  • 95
  • Many many thanks! I didn't want to overcomplicate the question. If a row also has a prefix, "Size", would it be easy to amend the code so that Size is always at the front: ie change ```The Large, Size Medium, Small``` to ```The Size Small, Medium, Large ``` – asd Feb 28 '21 at 09:01
  • 1
    @asd I'm glad this worked for you. It's late here - I'll sleep on the problem but you might get a quicker response with a new question, you can refer back to this one for context. Off the top of my head I thought adding `Size` to the keyword list might work but it's not quite right https://ideone.com/wUGJFR – Nick Feb 28 '21 at 10:55
  • Thanks no problem if not! Would you also know why the end of a sentence is sometimes chopped off, for both rows that contain and even rows that don't contain a keyword. For example "metre" in "small, large a metre" is being deleted – asd Feb 28 '21 at 14:37
  • 1
    @asd quick fix for the word deletion issue, I didn't allow for a non-size word at the end of the string, change the pattern to `r'((?:' + '|'.join(keywords) + ')(?:\sXS)?|(?:[^\s]*(?:\s|$)))'`. Still thinking about how best to resolve the `Size` issue. Got some ideas but real work is in the way... :P – Nick Feb 28 '21 at 22:40
  • 1
    @asd haven't had a lot of time to look at this, but here's something that does work, if not the most efficient: https://ideone.com/fUy1a7 – Nick Mar 01 '21 at 06:07
  • Many thanks, unfortunately I'm getting list index out of range for the sizes[0] line – asd Mar 01 '21 at 07:59
  • 1
    @asd all the code is case sensitive, that is likely causing that problem. You will need to convert case before comparison (perhaps make keywords all lowercase and translate each word as well before comparison) – Nick Mar 01 '21 at 08:19
  • 1
    @asd I think this does what you want (and is case insensitive). It's a bit ugly but I unfortunately don't have the time to look at it any more... https://ideone.com/fbvaIq – Nick Mar 01 '21 at 11:36