1

How can I separate the one column into two columns in R? I have the following data:

x
2. Nepsalus jezoensis
species, insects
3. Prochas sp. 2 YYH-2022a
species, wasps, ants & bees
4. Prochas sp. 1 YYH-2022a
species, wasps, ants & bees
5. Eccoptopterus sp. 1 CP-2022
species, beetles
6. Andricus sp. 1 CYS-2022a
species, wasps, ants & bees
7. Paralabellula curvicauda
species, earwigs
8. Paralabellula
genus, earwigs
9. Pristiphora sp.
species, hymenopterans
10. Phyllotreta flexuosa
species, beetles

But I need two columns x and y:

table

Many thanks for your help!

All the best!

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • 1
    Paste the output of `dput(this column)` into your question. – user2974951 Jul 06 '22 at 05:39
  • 1
    Hi! Welcome to SO. In which format did you load the dataset? .csv, .txt, . . . – R18 Jul 06 '22 at 05:48
  • 2
    `data.frame(scan(text=x, multi.line=TRUE, what=list(x="",y=""), sep="\n"))` - if that works for you, this is essentially a duplicate of https://stackoverflow.com/questions/68371066/is-it-possible-to-convert-lines-from-a-text-file-into-columns-to-get-a-dataframe/68371645 – thelatemail Jul 06 '22 at 05:52
  • Hi all, thanks for your reply. The format is .txt. And i tried to run the `data.frame(scan(text=x, multi.line=TRUE, what=list(x="",y=""), sep="\n"))` however the R found fatal error and started again. I think it happend because i have many rows, almost 1600000 rows. Please, tell me if you have any idea to resolve this problem? – Pedro Alexander Velasquez Vasc Jul 06 '22 at 15:52
  • @PedroAlexanderVelasquezVasc - unless you have a very limited machine, that many rows should not be an issue. `bigx <- paste(rep(x,1e5), collapse="\n"); system.time({out <- data.frame(scan(text=bigx, multi.line=TRUE, what=list(x="",y=""), sep="\n"))})` completed in less than a second here for 1,800,000 rows. – thelatemail Jul 06 '22 at 19:38
  • Where `x` was: `x <- "2. Nepsalus jezoensis\nspecies, insects\n3. Prochas sp. 2 YYH-2022a\nspecies, wasps, ants & bees\n4. Prochas sp. 1 YYH-2022a\nspecies, wasps, ants & bees\n5. Eccoptopterus sp. 1 CP-2022\nspecies, beetles\n6. Andricus sp. 1 CYS-2022a\nspecies, wasps, ants & bees\n7. Paralabellula curvicauda\nspecies, earwigs\n8. Paralabellula\ngenus, earwigs\n9. Pristiphora sp.\nspecies, hymenopterans\n10. Phyllotreta flexuosa\nspecies, beetles"` – thelatemail Jul 06 '22 at 19:39

1 Answers1

0

Try this:

library(tidyr)
data.frame(x) %>%
  # split the string into substrings demarcated by the digit followed by the period:
  separate_rows(x, sep = "\\n(?=\\d+\\.)") %>%
  # extract the relevant parts into two columns using `species` or `genus` as demarcation:
  extract(x,
          into = c("part1", "part2"),
          regex = "(.*)\\n(species.*|genus.*)")
# A tibble: 9 × 2
  part1                          part2                      
  <chr>                          <chr>                      
1 2. Nepsalus jezoensis          species, insects           
2 3. Prochas sp. 2 YYH-2022a     species, wasps, ants & bees
3 4. Prochas sp. 1 YYH-2022a     species, wasps, ants & bees
4 5. Eccoptopterus sp. 1 CP-2022 species, beetles           
5 6. Andricus sp. 1 CYS-2022a    species, wasps, ants & bees
6 7. Paralabellula curvicauda    species, earwigs           
7 8. Paralabellula               genus, earwigs             
8 9. Pristiphora sp.             species, hymenopterans     
9 10. Phyllotreta flexuosa       species, beetles

Data:

x <- "2. Nepsalus jezoensis
species, insects
3. Prochas sp. 2 YYH-2022a
species, wasps, ants & bees
4. Prochas sp. 1 YYH-2022a
species, wasps, ants & bees
5. Eccoptopterus sp. 1 CP-2022
species, beetles
6. Andricus sp. 1 CYS-2022a
species, wasps, ants & bees
7. Paralabellula curvicauda
species, earwigs
8. Paralabellula
genus, earwigs
9. Pristiphora sp.
species, hymenopterans
10. Phyllotreta flexuosa
species, beetles"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Thanks for your reply Chris, but not get yet. With your code, the following error appears: `Error in `as_indices_impl()`: ! Must subset columns with a valid subscript vector. x Subscript has the wrong type `tbl_df< part1: character part2: character >`. i It must be numeric or character.` I tried to do putting the variable as as.character but it didn't work `# A tibble: 2 x 2 part1 part2 1 NA NA 2 NA NA ` – Pedro Alexander Velasquez Vasc Jul 06 '22 at 15:01
  • what's the error? – Chris Ruehlemann Jul 06 '22 at 15:02
  • The error is the following: `Error in as_indices_impl(): ! Must subset columns with a valid subscript vector. x Subscript has the wrong type tbl_df< part1: character part2: character >. i It must be numeric or character.` – Pedro Alexander Velasquez Vasc Jul 06 '22 at 15:45
  • Well, the code works with the data I've used and posted. I assume that the error is because the structure of your actual data frame is different than the vector `x`. Can you post the output of `dput(head(YOURDATA))` in the question so I can take a look? – Chris Ruehlemann Jul 06 '22 at 15:52
  • Of course my friend. I have almost 1600000 rows: `structure(list(X1..Neodiprion.virginianus = c(" species, hymenopterans", "2. Nepsalus jezoensis", " species, insects", "3. Prochas sp. 2 YYH-2022a", " species, wasps, ants & bees", "4. Prochas sp. 1 YYH-2022a" )), row.names = c(NA, 6L), class = "data.frame")` – Pedro Alexander Velasquez Vasc Jul 06 '22 at 15:58
  • well, with all respect, that could hardly be any *more* different than what you posted earlier! I'm afraid I won't be able to help you with that. It appears that there have been some serious issues upon reading-in the data. Maybe you should try a different read-in option – Chris Ruehlemann Jul 06 '22 at 16:25