0

I am trying to open and clean a massive oceanographic dataset in R, where station information is interspersed as headers in between the chunks of observations:

$
 2008    1  774  8 17  5 11  2   78.4952    6.0375 30  7    1.2 -999.0 -9 -9 -9 -9 4868.8 2017  0  7114
    2.0    6.0297   35.0199   34.4101    2.0 11111
    3.0    6.0279   35.0201   34.4091    3.0 11111
    4.0    6.0272   35.0203   34.4091    4.0 11111
    5.0    6.0273   35.0204   34.4097    4.9 11111
    6.0    6.0274   35.0205   34.4104    5.9 11111
$
 2008    1  777  8 17 12  7 25   78.4738    8.3510 27  6    4.1 -999.0  3  7  2  0 4903.8 1570  0  7114
    3.0    6.4129   34.5637   34.3541    3.0 11111
    4.0    6.4349   34.5748   34.3844    4.0 11111
    5.0    6.4803   34.5932   34.4426    4.9 11111
    6.0    6.4139   34.5624   34.3552    5.9 11111
    7.0    6.5079   34.6097   34.4834    6.9 11111

each $ is followed by a row containing station data (e.g. year, ..., lat, lon, date, time), then follow several rows containing the observations sampled at that station (e.g. depth, temperature, salinity etc.).

I would like to add the station data to the observation, so that each variable is a column and each observation is a row, like this:

2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    2   6.0297  35.0199 34.4101 2   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    3   6.0279  35.0201 34.4091 3   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    4   6.0272  35.0203 34.4091 4   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    5   6.0273  35.0204 34.4097 4.9 11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    6   6.0274  35.0205 34.4104 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    3   6.4129  34.5637 34.3541 3   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    4   6.4349  34.5748 34.3844 4   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    5   6.4803  34.5932 34.4426 4.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    6   6.4139  34.5624 34.3552 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    7   6.5079  34.6097 34.4834 6.9 11111
Larusson
  • 267
  • 3
  • 21
  • Reasonable question. What have you tried and where have you failed? – Roman Luštrik Nov 14 '19 at 19:30
  • I have tried `read.table("file.txt", sep = "\t")` and `readLines("file.txt")`, both open the file, with rows as single-character strings. I haven't figured out a clever way to extract the rows with station data and add them as columns, in particular as the number of observationsvaries by station. – Larusson Nov 14 '19 at 19:57
  • Hard to tell without access to the data. Maybe they are not tab-separated? I would ahve guessed that your attempt with read.table would have error-ed out because of unequal line lengths. Reading in with `readLines` would be the way to start and then select only those lines starting with 2008 would be the way to proceed. There are many worked examples of this method on SO, but not very many would have a "tidy" tag. Here's one that I found: https://stackoverflow.com/questions/39085902/import-unusually-formatted-text-data-using-r/39087497#39087497 – IRTFM Nov 14 '19 at 20:47
  • Thanks, for the pointer. Sorry, the file is too large to post here. It does indeed throw an error message due to unequal line lengths if the `sep = "\t"` is not included. – Larusson Nov 14 '19 at 21:05
  • Replacing blanks with commas yields: `$ ,2008,,,,1,,774,,8,17,,5,11,,2,,,78.4952,,,,6.0375,30,,7,,,,1.2,-999.0,-9,-9,-9,-9,4868.8,2017,,0,,7114 ,,,,2.0,,,,6.0297,,,35.0199,,,34.4101,,,,2.0,11111 ,,,,3.0,,,,6.0279,,,35.0201,,,34.4091,,,,3.0,11111 ,,,,4.0,,,,6.0272,,,35.0203,,,34.4091,,,,4.0,11111` – Larusson Nov 14 '19 at 21:13

2 Answers2

2

This solution is pretty involved, and rests on knowledge of several Tidyverse libraries and features. I'm not sure how robust it is for your needs, but it does do okay with the sample you posted. But the approach of folding blocks, creating functions to parse the smaller blocks, and then unfolding the results I think will serve you well.

The first piece involves finding the '$' markers, grouping following lines together, and then "nesting" the block of data together. Then we have a data frame that has only a few rows - one for each section.

library(tidyverse)
txt_lns <- readLines("ocean-sample.txt") 

txt <- tibble(txt = txt_lns)

# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
  mutate(row_number = row_number()) %>%
  mutate(new_section = str_detect(txt, "\\$")) %>%            # Mark new sections
  mutate(starting = ifelse(new_section, row_number, NA)) %>%  # Index with row num
  tidyr::fill(starting) %>%                                   # Fill index down
                                                              # where missing
  select(-new_section) %>%                                    # Clean up
  filter(!str_detect(txt, "\\$")) %>%                         
  nest(data = c(txt, row_number))                             # "Nest" the data

# Take a quick look
nested_txt

Then, we need to be able to deal with those nested blocks. The routines here parse those blocks by identifying header rows, and then separating the fields into dataframes of their own. Here, we have different logic for header rows vs. the shorter lesser rows.

# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
  if (header_ind) {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>%
      # Separate the header row into 22 variables
      separate(txt, into = LETTERS[1:22], sep = "\\s+")
  } else {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>% 
      # Separate the lesser rows into 6 variables
      separate(txt, into  = letters[1:6], sep = "\\s+")
  }
  return(df)
}

parse_outer_block <- function(x) {
  df <- x %>%
    # Determine if it's a header row with 22 variables or lesser row with 6
    mutate(leading_row = (row_number == min(row_number))) %>%
    # Fold by header row vs. not
    nest(data = c(txt, row_number)) %>%
    # Create data frames for both header and lesser rows
    mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
    unnest(processed) %>%
    # Copy header row values to lesser rows
    tidyr::fill(A:V) %>%
    # Drop header row
    filter(!leading_row)
  return(df)
}

And then we can put it all together -- starting with our nested data, processing each block, unnesting the fields that came back, and prepping the full output.

# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
  mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
  select(-data) %>%
  unnest(proc_out) %>%
  select(-starting, -leading_row, -data, -row_number)

output 

Hope it helps. I'd recommend looking at some purrr tutorials as well for some similar problems.

ravic_
  • 1,731
  • 9
  • 13
2

This is simpler and only depends on base R. I assume that you have read the text file with x <- readLines(....) first:

start <- which(x == "$") + 1             # Find header indices
rows <- diff(c(start, length(x)+2)) - 2  # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
    cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)

This works with the two groups you included in your post. You will need to fix the column names since V1 - V6 are repeated at the beginning and end.

dcarlson
  • 10,936
  • 2
  • 15
  • 18