Regex to delete emojis from string

Question

I have a list of the Unicode emojis and I want to strip the emojis from them (i.e just want the whole first part and the name at the end of the row). A sample rows are like these ones:

1F468 1F3FD 200D 2695 FE0F   ; fully-qualified # ‍⚕️ man health worker: medium skin tone
1F469 1F3FF 200D 2695        ; non-fully-qualified # ‍⚕ woman health worker: dark skin tone

(from where I have deleted some spaces for the sake of simplicity). What I want is to match is the [non-]fully-qualified part as well as the # and the emoji, so I can delete them with sed. I have tried the following regex

 sed -e 's/\<[on-]*fully-qualified\># *.+?(?=[a-zA-Z]) //g'

which tries to match the words [non-]fully-qualified a space, the # symbol, and then whatever you can find (non-greedy) until the first letter, and replace it with an empty string.

I would like to have this output:

1F468 1F3FD 200D 2695 FE0F   ; man health worker: medium skin tone
1F469 1F3FF 200D 2695        ; woman health worker: dark skin tone

I have tried several posted answers to no avail, and besides, I'm trying to match a pattern between two boundaries which is were I'm having the trouble

EDIT: I'm trying to run the command in the git bash shipped with git for windows

Would this work for you? `sed 's/$.*;$.*#[^a-zA-Z]*$.*$/\1 \2/'` — MauricioRobayo, Aug 20 '17 at 15:31
Your sed script appears to be trying to use a PCRE but no version of sed supports PCREs. Which sed version are you running through - GNU or OSX or something else? — Ed Morton, Aug 20 '17 at 15:31
@archimiro seems to be doing something, but doesn't delete the whole thing in all cases. — mrbolichi, Aug 20 '17 at 15:40

Charles Srstka · Answer 1 · 2017-08-20T15:49:27.057

1

I like to search for what I actually want and then keep it.

This works on OS X in my testing:

sed -E 's/^([^#]+)#[^a-zA-Z\s]*(.*)$/\1 # \2/g'

EDIT: I don't have the Windows version of sed to try, but maybe this will work. Not as precise, but short and simple.

sed -e 's/#\s*[^a-zA-Z\s]*/# /g'

EDIT AGAIN: My bad, I read the question again and you wanted to delete more than just the emoji. This one should do it.

sed -e 's/;[^#]*#\s*[^a-zA-Z\s]*/; /g'

edited Aug 20 '17 at 15:49

answered Aug 20 '17 at 15:33

Charles Srstka

16,665
3
34
60

Still doesn't work. Similar output to that of the @argimiro's command – mrbolichi Aug 20 '17 at 15:45
What's the output, and what's the difference from what you're expecting? What does the second command turn the example input data from your question into on your machine? – Charles Srstka Aug 20 '17 at 15:46
Still the same as the previous one. This is the output: `1F468 1F3FD 200D 2695 FE0F ; ⚕️ man health worker: medium skin tone` I think it strips something because you can see that you don't get the full emoji here – mrbolichi Aug 20 '17 at 15:52
That is a weird result. Maybe some disagreement about encoding is causing the Windows version to have a different idea of what constitutes whitespace. Let's try it searching for literal spaces only: `sed -e 's/;[^#]*# *[^a-zA-Z ]*/; /g'` – Charles Srstka Aug 20 '17 at 16:04
Nope, same output. – mrbolichi Aug 20 '17 at 16:05
Is the source file's encoding in something other than UTF-8? – Charles Srstka Aug 20 '17 at 16:06
No, I opened the file in atom and it says UTF-8 – mrbolichi Aug 20 '17 at 16:07
Your copy of sed seems broken to me :-/ It's as if it's choking when it runs into non-ASCII characters. – Charles Srstka Aug 20 '17 at 16:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152392/discussion-between-c4tich-and-charles-srstka). – mrbolichi Aug 20 '17 at 16:19

MauricioRobayo · Accepted Answer · 2017-08-20T16:38:12.093

1

I'm still not pretty sure, but this might work:

sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /'

This will replace anything that is a semicolon ;, followed by any character .*, followed by the "fully-qualified" text, followed by any number of spaces, followed by a hashtag, followed by any character that is not a-zA-Z [^a-zA-Z], and replace all that with a semicolon followed by a space.

To be sure that the [a-zA-Z] captures only a to z and A to Z without any other characters, which seems to be the problem, a quick fix just for that command could be to use LC_ALL=C:

LC_ALL=C sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /' file

edited Aug 20 '17 at 16:38

answered Aug 20 '17 at 15:57

MauricioRobayo

2,207
23
26

Nope, the output is the same as posted in my third comment [here](https://stackoverflow.com/a/45783849/5477531) – mrbolichi Aug 20 '17 at 16:06
1

@c4tich Seems like a windows issue, you can try running sed with `LC_ALL=C sed ...`, for example: `LC_ALL=C sed 's/;.*fully-qualified *#[^a-zA-Z]*/; /'` – MauricioRobayo Aug 20 '17 at 16:24
This did the trick! Care to explain why? also, I can't understand the last semicolon in the regex... (_why_ does [a-zA-Z] match other things besides [a-zA-Z]?) – mrbolichi Aug 20 '17 at 16:32
2

Looked up LC_ALL, found this: https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do It appears that LC_ALL forces it to treat the input as simple ASCII instead of doing whatever Windows was doing to muck up the encoding and throw the regexes off. Seems like a handy thing to know for the future. Thanks @archimiro – Charles Srstka Aug 20 '17 at 16:35
1

Updated the answer, English if not my first language and I'm not very fluent, sorry for any grammar or spelling mistake, hope the explanation is useful. – MauricioRobayo Aug 20 '17 at 16:40

Regex to delete emojis from string

2 Answers2

Linked