0

I have 100+ csv files in some "random" format where I need to read only selected parts related to Activities and Sleep. The problem is that across files there is no set row number where either of those sections starts as well as the length of each section varies. In some cases, the heading may not be present, or be present but no data

To get the idea: enter image description here

I search stackoverflow and while reading from a particular row is not a problem, in this case that fixed row number is not available.. How can I do this?

m45ha
  • 399
  • 1
  • 9
  • If the sections you want to read always start with a particular heading present in the first column, then first just read the first column and find the starting points by searching for those headers. Then you will know how many rows to skip while reading for each section. – AdroMine May 27 '23 at 09:14
  • maybe these answers are helpful: https://stackoverflow.com/questions/58550614/r-importing-semi-unstructured-data-csv – I_O May 27 '23 at 10:12
  • Screenshot from Excel does not really define how the resulting (or imported) CSV is formatted, especially when Excel export can be affected by manual choices, Excel options and system localization settings. Please provide sample data through some reproducible method, i.e. output of `dput(readLines("sample.csv"))` where *sample.csv* is one of the csv files. Perhaps add few of those if the file layout is not consistent and headings are no present for some (which, btw, does not sound good at all). – margusl May 27 '23 at 10:17
  • example csv is here https://github.com/maria-pro/tutorials/blob/main/496.csv – m45ha May 28 '23 at 05:53

2 Answers2

1

We could first parse the file with meltr to identify starting and ending rows for each section. And then extracts (relevant) parts from the file.

library(dplyr)
library(tidyr)
library(readr)
library(meltr)
library(purrr)
library(stringr)

csv_ <- "https://raw.githubusercontent.com/maria-pro/tutorials/main/496.csv"

# get row/col/type for each token, empty rows are included as NA values
(melted <-  melt_csv(csv_))
#> # A tibble: 358 × 4
#>      row   col data_type value     
#>    <dbl> <dbl> <chr>     <chr>     
#>  1     1     1 character Body      
#>  2     2     1 character Date      
#>  3     2     2 character Weight    
#>  4     2     3 character BMI       
#>  5     2     4 character Fat       
#>  6     3     1 character 26-06-2022
#>  7     3     2 integer   73        
#>  8     3     3 double    29.62     
#>  9     3     4 integer   50        
#> 10     4     1 character 27-06-2022
#> # ℹ 348 more rows

# Keep only rows with no more than 1 column (including empty rows);
# get all records before the first "Food Log" section;
# fill NAs in value, recode data_type to start/end, pivot wider; 
# adjust start_row and end_row to match header and last non-emtpy line of 
# the section
csv_sections <- melted %>% 
  filter(max(col) == 1, .by = row) %>% 
  # handle files without "Food Log" entries by appending a fake record
  add_row(row = Inf, value = "Food Log end of file marker") %>% 
  filter(row < row[str_detect(value,"Food Log") %>% which.max()]) %>% 
  fill(value) %>% 
  mutate(data_type = case_match(data_type, "character" ~ "start_row", "missing" ~ "end_row" )) %>% 
  pivot_wider(names_from = data_type, values_from = row) %>% 
  mutate(start_row = start_row + 1,
           end_row =   end_row - 1) %>% 
  select(section = value, start_row, end_row)

csv_sections
#> # A tibble: 4 × 3
#>   section    start_row end_row
#>   <chr>          <dbl>   <dbl>
#> 1 Body               2       9
#> 2 Foods             12      19
#> 3 Activities        22      29
#> 4 Sleep             32      37

# read CSV as as list of lines
csv_lines <- read_lines(csv_)
# cycle though csv_sections and parse subsets of csv_lines rows as CSV;
# I() makes read_csv() recognize the input as literal data
sections <- pmap(csv_sections, 
                 \(section, start_row, end_row) read_csv(I(csv_lines[start_row:end_row]), 
                                                         show_col_types = FALSE)) %>% 
  set_names(csv_sections$section)

Results:

sections$Activities
#> # A tibble: 7 × 10
#>   Date       `Calories Burned` Steps Distance Floors `Minutes Sedentary`
#>   <chr>                  <dbl> <dbl>    <dbl>  <dbl>               <dbl>
#> 1 26-06-2022              1411     0     0         0                1440
#> 2 27-06-2022              1411     0     0         0                1440
#> 3 28-06-2022              1411     0     0         0                1440
#> 4 29-06-2022              2128  8292     5.38     27                1291
#> 5 30-06-2022              1678  1016     0.66      9                 488
#> 6 01-07-2022              1634  1078     0.7      10                 790
#> 7 02-07-2022              1537   732     0.47      7                 846
#> # ℹ 4 more variables: `Minutes Lightly Active` <dbl>,
#> #   `Minutes Fairly Active` <dbl>, `Minutes Very Active` <dbl>,
#> #   `Activity Calories` <dbl>
sections$Sleep
#> # A tibble: 5 × 9
#>   `Start Time`        `End Time`         `Minutes Asleep` `Minutes Awake`
#>   <chr>               <chr>                         <dbl>           <dbl>
#> 1 01-07-2022 8:26 pm  02-07-2022 5:35 am              473              76
#> 2 01-07-2022 6:53 am  01-07-2022 9:48 am              155              20
#> 3 30-06-2022 8:58 pm  01-07-2022 3:38 am              341              59
#> 4 30-06-2022 11:45 am 30-06-2022 4:00 pm              225              30
#> 5 29-06-2022 10:03 pm 30-06-2022 8:19 am              524              92
#> # ℹ 5 more variables: `Number of Awakenings` <dbl>, `Time in Bed` <dbl>,
#> #   `Minutes REM Sleep` <chr>, `Minutes Light Sleep` <chr>,
#> #   `Minutes Deep Sleep` <chr>

Created on 2023-05-28 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20
  • Thank yoU! it nearly gets there, but! when it has a file with "missing" sections it gets buggy, like when other sections are missing in this https://github.com/maria-pro/tutorials/blob/main/454.csv or this https://github.com/maria-pro/tutorials/blob/main/488.csv. – m45ha May 29 '23 at 00:23
  • I am thinking since sections are separated by blank lines would it be a solution to convert csvs to excel and split them at blank lines? – m45ha May 29 '23 at 00:24
  • It relies on detecting a Food Log entry, lazy fix is just to append one in code. You can try if updated answer works for you. By the way, structure and content hints those files may have been converted from JSON , if that's the case, you could skip the whole csv-excel thing and work with whose JSONs in R. And if it happens to be Fitbit data, you could just use http://matthewrkaye.com/fitbitr/ . – margusl May 29 '23 at 09:43
  • i was not reading this carefully! dump me! – m45ha Jun 08 '23 at 09:02
  • 1
    can i give you double tick? :) and a hug! and a kiss! :) Thank you for help! such a beautiful solution! – m45ha Jun 08 '23 at 09:03
  • Nice to hear it worked for you. – margusl Jun 08 '23 at 11:25
  • can i ask for help one more time with this code? it looks like it works magic for some files,e.g https://github.com/maria-pro/tutorials/blob/main/496.csv, but not for others, e.g. https://github.com/maria-pro/tutorials/blob/main/036.csv They look identical to me, why? how to fix? – m45ha Jun 12 '23 at 10:14
  • @m45ha, those two are quite different in structure - https://i.stack.imgur.com/3qToi.png . One is a ragged file, rows with varying number of fields, and sections separated by empty lines. The other is rectangular csv, each row includes fixed number of fields (9) which may or may not be empty. Different line ending too. If I had to guess, `036.csv` is a result of Excel csv export. I would advise against comparing those in Excel, instead use a text editor that can display non-printing characters (e.g. Notepad++) or press `.` in github to launch vs code editor to see the different layout. – margusl Jun 12 '23 at 11:11
  • Easy fix would be to skip any preprocessing in Excel. It's easier and lot faster to come up with a new answer from scratch than trying to modify an existing one to accept files like `036.csv` , though the one by @I_O might actually work for those cases. – margusl Jun 12 '23 at 11:20
  • This is how i got them :( i did not save them in Excel myself, apparently that was before me :( – m45ha Jun 12 '23 at 21:33
0

TL;DR: You can try the following:

library(terra)
library(purrr)

the_data <- read.csv('path/to/your/file.csv', header = FALSE,
                     blank.lines.skip = FALSE ## !
)

the_raster <- the_data |> 
  as.matrix() |>
  (\(x) ifelse(x == '', NA, x))() |>
  rast()

the_chunks <- patches(the_raster)

get_chunk_index  <-  function(m){
  matrix(m, nrow(m), ncol(m), byrow = TRUE) |>
    apply(MARGIN = 1 , FUN = \(row) paste0('chunk_',unique(row)[1]))
}

data_chunks <-  the_data |> split(f = list(get_chunk_index(the_chunks)))

## grab activity data:
keep(data_chunks, ~ any(grepl('Activities', .x)))

how it works

  • read in your data with your csv-reader of choice
  • convert the resulting dataframe to a matrix (ensuring that empty cells of the originating spreadsheet are converted to NA)
  • convert the matrix to a raster and use a package for image/spatial analysis (here: {terra}) to find patches of continguos non-empty cells (the analog to CTRL+A in spreasheet software
  • reconvert the raster of patches (=data chunks) to a matrix and extract the vector of patch numbers each row belongs to (or NA for empty lines)
  • split your initial data by this index vector, and extract/process the desired chunk from this list as desired.
  1. get data. Remember to set your csv-importer to keep blank lines (as those are used to distinguish between data blocks).
the_data <- read.csv('path/to/your/file.csv', header = FALSE,
                     blank.lines.skip = FALSE
)
  1. convert to matrix, define NAs (here: empty strings), convert to spatial raster, inspect:
the_raster <- the_data |> 
  as.matrix() |>
  (\(x) ifelse(x == '', NA, x))() |>
  rast()

plot(the_raster)

raster from matrix

  1. find continguous non-NA raster ranges (think CTRL + A), inspect:
the_chunks <- patches(the_raster)
plot(the_chunks)

continguous data chunks

  1. helper function to extract the first non-NA row value (=chunk index) from a matrix:
get_chunk_index  <-  function(m){
  matrix(m, nrow(m), ncol(m), byrow = TRUE) |>
    apply(MARGIN = 1 , FUN = \(row) paste0('chunk_',unique(row)[1]))
}
  1. split initial dataframe by chunk index and get a list of data chunks:
data_chunks <-  the_data |> split(f = list(get_chunk_index(the_chunks)))
data_chunks
# > data_chunks
$chunk_1
         V1        V2        V3
1 Some junk Some junk Some junk
2 Some junk Some junk Some junk
3 Some junk Some junk Some junk

$chunk_2
           V1 V2 V3
8  Activities      
9          A1 A2 A3
10          2  1  3
11          3  4  6
# ...
  1. proceed with the desired chunks. Example: pick any chunk containing "Activity":
library(purrr) ## for convenient list operations
keep(data_chunks, ~ any(grepl('Activities', .x)))
$chunk_2
           V1 V2 V3
8  Activities      
9          A1 A2 A3
10          2  1  3
11          3  4  6

I used this example structure:

example of semi-structured data

I_O
  • 4,983
  • 2
  • 2
  • 15
  • unfortunately, it did not work... it did remove the blanks but kept everything... – m45ha May 28 '23 at 05:51
  • m45ha, thanks for pointing this out. Setting your csv import to keep blank lines works with the example data you provided. Try edited answer plz. – I_O May 29 '23 at 06:43