Extract characters from a string by a succession of colons

Question

I am trying to pull some information out of a variable in a data frame. I am using R 3.3.3.

The information formatted as follows:

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

I would like to break down each section into a separate variable like so:

w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."

x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."

y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."

z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

I am having some difficulty trying to extract this information. SO questions such as this and this have been very helpful. From these, I gathered that some form of stringr/ gsub can be used to pull this information but I can't figure out how to specify the ranges within a gsub statement.

I have been able to work out the how to pull the first portion:

>test4 <- gsub("(.*{1})(:.*)","\\1", t)

which gives

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

My overall question is:

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

It would be nice if I did not have to clean up the "DOMINICAN REPUBLIC" part from the end of the string.

In summary:

1. How you extract characters from a string by a succession of colons? (1st to 2nd colon, 2nd to 3rd etc)

2. Is there a way to keep the words infront of the colon as well?

Any information or guidance would be greatly appreciated.

acylam · Accepted Answer · 2017-11-22T15:30:36.440

You can use strsplit with an appropriate regex:

strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)

or

stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")

Notes:

\\.\\s matches a literal dot and a space.
(?=[\\w\\s]+:) is a positive lookahead that matches either a word character or space one or more times following a colon.
\\.\\s(?=[\\w\\s]+:) thus matches a dot and a space only if it is immediately followed by either a word character or a space one or more times and a colon. This would be the end of each paragraph.
Since I am using the regex within strsplit, I am splitting by whatever is matched by the regex. This results in splitting by the end of each paragraph.
perl=TRUE is needed to enable lookaheads/behinds.

Result:

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

This works great! I am going to have to take some time to fully understand how it is partitioning out the data, but thank you very much! — Jbnimble, Nov 22 '17 at 15:01

Maurits Evers · Answer 2 · 2017-11-21T22:50:10.527

How about the following in base R?

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

Note: Regexp-matching countries is a bit awkward because your example contains all caps multi-word countries (DOMINCAN REPUBLIC), all caps single-word countries (e.g. GUAM), and "first-letter-caps" countries (China).

Thank you! I tried this one out as well and it worked great! — Jbnimble, Nov 22 '17 at 15:03

Extract characters from a string by a succession of colons

2 Answers2