Remove all text before colon

Question

I have a file containing a certain number of lines. Each line looks like this:

TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1

I would like to remove all before ":" character in order to retain only PKMYT1 that is a gene name. Since I'm not an expert in regex scripting can anyone help me to do this using Unix (sed or awk) or in R?

score 114 · Accepted Answer · answered Sep 06 '12 at 10:23

114

Here are two ways of doing it in R:

foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"

# Remove all before and up to ":":
gsub(".*:","",foo)

# Extract everything behind ":":
regmatches(foo,gregexpr("(?<=:).*",foo,perl=TRUE))

answered Sep 06 '12 at 10:23

Sacha Epskamp

46,463
20
113
131

21

Also, if any of the gene names might themselves self contain a `:`, you could match and replace up to the *first* `:` using `gsub("^[^:]*:", "", foo)` – Josh O'Brien Sep 06 '12 at 15:49
`gsub()` allows you to use "regular expressions". In the answer above, the `.` means wildcard (any character), the `*` means "zero or more occurences", and then the `:` is the symbol we're interested in stopping at. If, say, you wanted to remove all before a `-`, you could replace the colon with one. After the `".*:",` argument, you're putting your replacement for whatever appears before the `:`, so be sure to to avoid adding an erroneous space between the quotes there. – Pake Jan 10 '23 at 22:12

score 25 · Answer 2 · answered Sep 06 '12 at 10:22

25

A simple regular expression used with gsub():

x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
gsub(".*:", "", x)
"PKMYT1"

See ?regex or ?gsub for more help.

answered Sep 06 '12 at 10:22

Andrie

176,377
47
447
496

John Carter · Answer 3 · 2012-09-06T19:56:51.903

Using sed:

sed 's/.*://' < your_input_file > output_file

This will replace anything followed by a colon with nothing, so it'll remove everything up to and including the last colon on each line (because * is greedy by default).

As per Josh O'Brien's comment, if you wanted to only replace up to and including the first colon, do this:

sed "s/[^:]*://"

That will match anything that isn't a colon, followed by one colon, and replace with nothing.

Note that for both of these patterns they'll stop on the first match on each line. If you want to make a replace happen for every match on a line, add the 'g' (global) option to the end of the command.

Also note that on linux (but not on OSX) you can edit a file in-place with -i eg:

sed -i 's/.*://' your_file

score 10 · Answer 4 · answered Sep 06 '12 at 11:59

10

There are certainly more than 2 ways in R. Here's another.

unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2))

If the string has a constant length I imagine substr would be faster than this or regex methods.

answered Sep 06 '12 at 11:59

John

23,360
7
57
83

I suspect this may be the fastest R solution given. +1 – Tyler Rinker Sep 06 '12 at 12:28

score 8 · Answer 5 · answered Aug 13 '21 at 07:49

8

Solution using str_remove from the stringr package:

str_remove("TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1", ".*:")
[1] "PKMYT1"

answered Aug 13 '21 at 07:49

ToWii

590
5
8

score 6 · Answer 6 · answered Sep 06 '12 at 10:31

6

You can use awk like this:

awk -F: '{print $2}' /your/file

answered Sep 06 '12 at 10:31

Costi Ciudatu

37,042
7
56
92

score 5 · Answer 7 · answered Jan 04 '18 at 17:45

Some very simple move that I missed from the best response @Sacha Epskamp was to use the sub function, in this case to take everything before the ":"(instead of removing it), so it was very simple:

foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"

# 1st, as she did to remove all before and up to ":":
gsub(".*:","",foo)

# 2nd, to keep everything before and up to ":": 
gsub(":.*","",foo)

Basically, the same thing, just change the ":" position inside the sub argument. Hope it will help.

score 2 · Answer 8 · answered Sep 06 '12 at 12:49

2

If you have GNU coreutils available use cut:

cut -d: -f2 infile

answered Sep 06 '12 at 12:49

Thor

45,082
11
119
130

score 2 · Answer 9 · answered Nov 30 '17 at 23:32

I was working on a similar issue. John's and Josh O'Brien's advice did the trick. I started with this tibble:

library(dplyr)
my_tibble <- tibble(Col1=c("ABC:Content","BCDE:MoreContent","FG:Conent:with:colons"))

It looks like:

  | Col1 
1 | ABC:Content 
2 | BCDE:MoreContent 
3 | FG:Content:with:colons

I needed to create this tibble:

  | Col1                  | Col2 | Col3 
1 | ABC:Content           | ABC  | Content 
2 | BCDE:MoreContent      | BCDE | MoreContent 
3 | FG:Content:with:colons| FG   | Content:with:colons

And did so with this code (R version 3.4.2).

my_tibble2 <- mutate(my_tibble
        ,Col2 = unlist(lapply(strsplit(Col1, ':',fixed = TRUE), '[', 1))
        ,Col3 = gsub("^[^:]*:", "", Col1))

score 0 · Answer 10 · answered Oct 09 '15 at 17:59

Below are 2 equivalent solutions:

The first uses perl's -a autosplit feature to split each line into fields using :, populate the F fields array, and print the 2nd field $F[1] (counted starting from field 0)

perl -F: -lane 'print $F[1]' file

The second uses a regular expression to substitute s/// from ^ the beginning of the line, .*: any characters ending with a colon, with nothing

perl -pe 's/^.*://' file

Remove all text before colon

10 Answers10

Linked

Related