0

I want to analyze some currency trading data from a website, but the data are only accessible via copy-and-paste. I copy it to my computer's clipboard and import it into R via:

#df <- read.table("clipboard", header = FALSE, sep = "\t", stringsAsFactors = FALSE, na.strings = "", fill = TRUE)

But the data frame drops a single observation into four rows when it's read into R:

df <- structure(list(V1 = c("Buy", "Completed", "Fee1.00 USD", "Total199.00 USD", "Buy", "Completed", "Fee0.50 USD", "Total100.00 USD", "Buy", "Completed", "Fee0.64 USD", "Total127.00 USD"), V2 = c(NA, "2021-02-11 20:49:19", NA, NA, NA, "2021-02-11 20:48:03", NA, NA, NA, "2021-02-11 20:47:22", NA, NA), V3 = c(NA, "0.11057", NA, NA, NA, "82.146", NA, NA, NA, "30.15", NA, NA)), row.names = c(NA, 12L), class = "data.frame")
df

#               V1                  V2      V3
#1              Buy                <NA>    <NA>
#2        Completed 2021-02-11 20:49:19 0.11057
#3      Fee1.00 USD                <NA>    <NA>
#4  Total199.00 USD                <NA>    <NA>
#5              Buy                <NA>    <NA>
#6        Completed 2021-02-11 20:48:03  82.146
#7      Fee0.50 USD                <NA>    <NA>
#8  Total100.00 USD                <NA>    <NA>
#9              Buy                <NA>    <NA>
#10       Completed 2021-02-11 20:47:22   30.15
#11     Fee0.64 USD                <NA>    <NA>
#12 Total127.00 USD                <NA>    <NA>

I'd like to therefore collapse every series of four rows into one, like this, that overwrites the missing values generated as a quirk of the data importation process:

want <- structure(list(V1 = structure(c(1L, 1L, 1L), .Label = "Buy", class = "factor"), V2 = structure(c(1L, 1L, 1L), .Label = "Completed", class = "factor"), V3 = structure(3:1, .Label = c("2/11/2021 20:47", "2/11/2021 20:48", "2/11/2021 20:49"), class = "factor"), V4 = c(0.11057, 82.146,     30.15), V5 = structure(c(3L, 1L, 2L), .Label = c("Fee0.50 USD", "Fee0.64 USD", "Fee1.00 USD"), class = "factor"), V6 = structure(c(3L, 1L, 2L), .Label = c("Total100.00 USD", "Total127.00 USD", "Total199.00 USD"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
want

#   V1        V2        V3            V4      V5              V6
#1 Buy Completed 2/11/2021 20:49  0.11057 Fee1.00 USD Total199.00 USD
#2 Buy Completed 2/11/2021 20:48 82.14600 Fee0.50 USD Total100.00 USD
#3 Buy Completed 2/11/2021 20:47 30.15000 Fee0.64 USD Total127.00 USD

Obviously, things would still be a bit messy as I'd then need to split up some strings into separate columns (e.g. df$V5 = "Fee1.00 USD" would become df$Fee = 1.00), but that's a different issue.

I've tried adding an id variable and then reshaping from long to wide, as discussed here, but that gets even messier by taking values I need (e.g. the 1.00 in "Fee1.00 USD") and putting them as new column names:

df$id <- gl((nrow(df)/4), 4)
reshape(df, timevar = "V1", idvar = "id", direction = "wide")

And I've tried splitting the data frame into a list of data frames, as discussed here, but I'm still not sure how to collapse each one and stitch it back together:

split(df, f = df$id)

What is the best way to get the data into proper format?

coip
  • 1,312
  • 16
  • 30

2 Answers2

3

How about this:

library(dplyr)
library(tidyr)
df <- df %>% mutate(obs = rep(1:(nrow(.)/4), each=4))
df <- df %>% 
  pivot_longer(-obs, names_to="var", values_to="vals") %>% 
  na.omit() %>% 
  group_by(obs) %>% 
  mutate(col = seq_along(obs)) %>% 
  select(obs, col, vals) %>% 
  pivot_wider(names_from="col", names_prefix="V", values_from="vals")
df
# # A tibble: 3 x 7
# # Groups:   obs [3]
#     obs V1    V2        V3                  V4      V5          V6             
#   <int> <chr> <chr>     <chr>               <chr>   <chr>       <chr>          
# 1     1 Buy   Completed 2021-02-11 20:49:19 0.11057 Fee1.00 USD Total199.00 USD
# 2     2 Buy   Completed 2021-02-11 20:48:03 82.146  Fee0.50 USD Total100.00 USD
# 3     3 Buy   Completed 2021-02-11 20:47:22 30.15   Fee0.64 USD Total127.00 USD 
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
0

Dave's answer works perfectly and is nicely concise. If for some reason someone can't use external packages, I used his script as a guide and tried to replicate it using base R:

df$id <- gl((nrow(df) / 4), 4)
df <- reshape(df, idvar = "id",
              v.names = "val",
              timevar = "var",
              times = names(df[1:3]),
              varying = names(df[1:3]),
              new.row.names = 1:1000,
              direction = "long")
df <- na.omit(df)
df <- df[order(df$id),]
df$col <- ave(seq_len(nrow(df)), df$id, FUN = seq_along)
df <- subset(df, select = c("id", "col", "val"))
df <- reshape(df, timevar = "col",
              idvar = "id",
              direction = "wide")
colnames(df) <- c("id", "V1", "V2", "V5", "V6", "V3", "V4")
varnames <- c("id", "V1", "V2", "V3", "V4", "V5", "V6")
df <- df[, varnames]
df
coip
  • 1,312
  • 16
  • 30