0

I am forming a new column in a pandas dataframe and I want to enter the short name for operating systems. I am using regex and need to exact match words to exclude from the selection however when I change the regex to not select the words it then stops exact matching. I have read as many regex exact match word posts here as possible and none of the solutions work.

so for example I have data which looks like this:

Android 10kdsh
Chrome OS
Linux ddk2
OS X 10.
Windows 7
iOS c

and I want it to look like this:

Android 
Chrome
Linux
OS X
Windows
iOS

I tried code as follows:

def short_OS(webchat):

    webchat["OS"] = webchat["Operating System"].str.replace(('[^(Android|^OS X|^Chrome|^Linux|^Windows|^iOS)]'),"", regex = True)

    return webchat

but this leaves some of the characters in as leaving:

Androiddsh
ChromeOS
Linuxdd
OS X
Windows
iOS

obviously the above are just examples but the principle about some of the characters being left in as they are in the words are the same.

I should note that framing the words with \b did not change the outcome. and if I use the $ for the end of string, in the example of 'Android' it still leaves the '10kdsh' in on the same line

can anyone help please?

thank you

Mizz H
  • 67
  • 6
  • This is not quite clear: you want to keep `X` with `OS X`, but your list of "words" does not inlcude neither `OS X`, nor `X`. What are you real requirements? Also, are you after creating a dynamic pattern from a list of items, or can you simply hardcode them as in the [answer](https://stackoverflow.com/a/65128901/3832970) below? – Wiktor Stribiżew Dec 03 '20 at 15:59
  • sorry to confuse. edited code to include OS X. I must have dropped it off during all the trial and error. I'm trying to end up with specific descriptors so the list of OS is shorter and can be used in reporting so rather than Windows 7, Windows 8.1, etc it says Windows. – Mizz H Dec 03 '20 at 21:59

2 Answers2

2

Instead of replacing, you can match one of the alternatives extract it to a new column.

webchat = pd.DataFrame(data, columns=["Operating System"])
webchat["OS"] = webchat["Operating System"].str.extract((r"^(Android|Chrome|Linux|OS X|Windows|iOS)\b"))
print(webchat)

Output

  Operating System       OS
0   Android 10kdsh  Android
1        Chrome OS   Chrome
2       Linux ddk2    Linux
3         OS X 10.     OS X
4        Windows 7  Windows
5            iOS c      iOS
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • I haven't heard of 'extract' before. that definitely seems the way to go rather than excluding everything except the words i want. However, when I run it with your code above I still get some extra characters in OS col. – Mizz H Dec 03 '20 at 21:52
  • @MizzH What are the other values that give extra characters? – The fourth bird Dec 03 '20 at 22:10
0

Using the approach from @The fourth bird I solved this using the following code:

def short_OS(webchat):
 
    webchat["OS"] = webchat["Operating System"].str.extract(r"(\bAndroid\b|\bOS X\b|\bChrome\b|\bLinux\b|\bWindows\b|\biOS\b)")

    return webchat

the /b surrounding the words was needed to capture the exact words

Mizz H
  • 67
  • 6