I am trying to pull some information out of a variable in a data frame. I am using R 3.3.3.
The information formatted as follows:
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I would like to break down each section into a separate variable like so:
w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."
z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I am having some difficulty trying to extract this information. SO questions such as this and this have been very helpful. From these, I gathered that some form of stringr/ gsub can be used to pull this information but I can't figure out how to specify the ranges within a gsub statement.
I have been able to work out the how to pull the first portion:
>test4 <- gsub("(.*{1})(:.*)","\\1", t)
which gives
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
My overall question is:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
It would be nice if I did not have to clean up the "DOMINICAN REPUBLIC" part from the end of the string.
In summary:
1. How you extract characters from a string by a succession of colons? (1st to 2nd colon, 2nd to 3rd etc)
2. Is there a way to keep the words infront of the colon as well?
Any information or guidance would be greatly appreciated.