0

This question is quite difficult to describe, but easy to understand when visualized. I would therefore suggest looking at the two images that I linked to this post to help facilitate understanding the issue.

Here is a link to my practice data frame:

sample.data <-read.table("https://pastebin.com/uAQD6nnM", header=T, sep="\t")

I don't know why I get an error "more columns than column names", because using the same file from my desktop works just fine, however clicking on the link goes to my dataset.

I received very large data frames that are arranged in rows, and I want it to be put it in columns, however it is not that 'easy', because I do not necessarily want (or need) to transpose all the data.

This link appears to be close to what I would like to do, but just not quite the right answer for me Python Pandas: Transpose or Stack?

I have a header with GPS data (Coords_Y, Coords_X), followed by a list of 100+ plant species names. If a species is present at a certain location, the author used the term TRUE, and if not present, they used the term FALSE.

I would like to take this data set I've been sent, create a new column called "species", where it stacks each of the species listed in rows on top of each other , & keeps only data set to TRUE. Therefore, as my images point out, if 2 plants are both present at the same location, then the GPS points will need to be duplicated so no data point is lost, and at the same time, if a certain species is present at many locations, the species name will need to be repeated multiple times in the column. In the end, I will have a dataset that is 1000's of rows long, but only 5 columns in my header row.

Before before photo

After After photo

pirho
  • 11,565
  • 12
  • 43
  • 70
Andy
  • 413
  • 2
  • 15
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Feb 12 '18 at 15:49
  • Thank you for your quick response, & I did visit all of these forums, like the one you sent a link to, ,however as I mentioned, this question was difficult to verbalize, but easy to visualize. I appreciate your feedback, and will work on trying to make questions stated more clearly – Andy Feb 12 '18 at 15:54
  • Please `dput()` the data in the pictures. – Richard Border Feb 12 '18 at 15:58
  • Hello, thanks for your response. I am unaware of the dput() function, I will see how I can use this and update my post to make more sense. Thank you – Andy Feb 12 '18 at 16:01

2 Answers2

1

Here is a way to do it using base R:

# Notice that the link works if you include the /raw/ part
sample.data <-read.table("https://pastebin.com/raw/uAQD6nnM", header=T, sep="\t")

vars <- c("var0", "Var.1", "Coords_y", "Coords_x")

# Just selects the ones marked TRUE for each
alf <- sample.data[ sample.data$Alfaroa.williamsii, vars ]
aln <- sample.data[ sample.data$Alnus.acuminata, vars ]
alf$species <- "Alfaroa.williamsii"
aln$species <- "Alnus.acuminata"
final <- rbind(alf,aln)
final

var0 Var.1  Coords_y  Coords_x            species
192   191   7.10000 -73.00000 Alfaroa.williamsii
101   100 -13.18000 -71.59000 Alfaroa.williamsii
36    35  10.18234 -84.10683    Alnus.acuminata
38    37  10.26787 -84.05528    Alnus.acuminata

To do it more generally, using dplyr and tidyr, you can use the gather function:

library(dplyr)
library(tidyr)

tidyr::gather(sample.data, key = "species", value = "keep", 5:6) %>%
dplyr::filter(keep) %>% 
dplyr::select(-keep)

Just replace the 5:6 with the indices of the columns of different species.

C. Braun
  • 5,061
  • 19
  • 47
  • Hello Great! This is a step in the right direction. Thank you. My example here is with 2 species (Alfaroa.williamsii & Alnus.acuminata), however I have quite a few variables (100+ species) to do this on. Imagine I had posted 20 variables, found in rows 5:25 of the original data frame, Would there be a way to do this over a very large data frame without having to each time writing out the species name? Thanks a lot for your help – Andy Feb 12 '18 at 16:15
  • I have updated my answer with a more general solution. Hope this helps! – C. Braun Feb 12 '18 at 16:33
  • Hi. C. Braun . This works great, and I tried it over a very large dataset, and it is exactly what I was looking for. I find the code interesting because I was expecting an "if / else" command somewhere to filter FALSE and TRUE, but this code you wrote just does that, and I do not quite see why. Nonetheless, the problem is solved and I can move on with other analysis. Many thanks once again. – Andy Feb 13 '18 at 08:26
  • Hi @Andy, to explain a little more, the "if/else" happens in the `filter` function. When we first read in the data using `read.table`, the function internally calls `type.convert`, which makes all of the species columns of type logical (i.e. TRUE or FALSE instead of "TRUE" or "FALSE"). Then, the `filter` function keeps only the rows that evaluate to TRUE. An equivalent statement for filtering (after gathering) would be `dplyr::filter(keep == TRUE)`. – C. Braun Feb 13 '18 at 14:39
0

I could not download the data so I made some:

sample.data=data.frame(var0=c(192,36,38,101),var1=c(191,35,37,100),y=c(7.1,10.1,10.2,-13.8),x=c(-73,-84,-84,-71),
                       Alfaroa=c(T,F,F,T),Alnus=c(T,T,T,F))

the code that gives the requested result is:

dfAlfaroa=sample.data%>%filter(Alfaroa)%>%select(-Alnus)%>%rename("Species"="Alfaroa")%>%replace("Species","Alfaroa")
dfAlnus=sample.data%>%filter(Alnus)%>%select(-Alfaroa)%>%rename("Species"="Alnus")%>%replace("Species","Alnus")
rbind(dfAlfaroa,dfAlnus)
Antonios
  • 1,919
  • 1
  • 11
  • 18