2

I am trying to pull some information out of a variable in a data frame. I am using R 3.3.3.

The information formatted as follows:

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

I would like to break down each section into a separate variable like so:

w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."

x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."

y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."

z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

I am having some difficulty trying to extract this information. SO questions such as this and this have been very helpful. From these, I gathered that some form of stringr/ gsub can be used to pull this information but I can't figure out how to specify the ranges within a gsub statement.

I have been able to work out the how to pull the first portion:

>test4 <- gsub("(.*{1})(:.*)","\\1", t)

which gives

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

My overall question is:

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

It would be nice if I did not have to clean up the "DOMINICAN REPUBLIC" part from the end of the string.

In summary:

1. How you extract characters from a string by a succession of colons? (1st to 2nd colon, 2nd to 3rd etc)

2. Is there a way to keep the words infront of the colon as well?

Any information or guidance would be greatly appreciated.

acylam
  • 18,231
  • 5
  • 36
  • 45
Jbnimble
  • 39
  • 7

2 Answers2

2

You can use strsplit with an appropriate regex:

strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)

or

stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")

Notes:

  1. \\.\\s matches a literal dot and a space.
  2. (?=[\\w\\s]+:) is a positive lookahead that matches either a word character or space one or more times following a colon.
  3. \\.\\s(?=[\\w\\s]+:) thus matches a dot and a space only if it is immediately followed by either a word character or a space one or more times and a colon. This would be the end of each paragraph.
  4. Since I am using the regex within strsplit, I am splitting by whatever is matched by the regex. This results in splitting by the end of each paragraph.
  5. perl=TRUE is needed to enable lookaheads/behinds.

Result:

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region." 
acylam
  • 18,231
  • 5
  • 36
  • 45
1

How about the following in base R?

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

Note: Regexp-matching countries is a bit awkward because your example contains all caps multi-word countries (DOMINCAN REPUBLIC), all caps single-word countries (e.g. GUAM), and "first-letter-caps" countries (China).

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68