R importing semi-unstructured data CSV

Question

I am used to loading a straight csv with the column headers and a single table into R, I have a large csv file that has the following structure:

+-----------+---------+--------+---------+--------+---------+
| file_name |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| table1    |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| Var1      | Var2    | Var3   | Var4    | Var5   | Var6    |
+-----------+---------+--------+---------+--------+---------+
| 198824    | 198824  | 198824 | 198824  | 198824 | 198824  |
+-----------+---------+--------+---------+--------+---------+
| 123       | 1234    | 1242   | 124     | 1241   | 1232    |
+-----------+---------+--------+---------+--------+---------+
|           |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
|           |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| file_name |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| table2    |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| Var1      | Var2    | Var3   | Var4    | Var5   | Var6    |
+-----------+---------+--------+---------+--------+---------+
| x         | x       | x      | x       | x      | x       |
+-----------+---------+--------+---------+--------+---------+
| y         | y       | y      | y       | y      | y       |
+-----------+---------+--------+---------+--------+---------+
| z         | z       | z      | z       | z      | z       |
+-----------+---------+--------+---------+--------+---------+
|           |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
|           |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| file_name |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| table3    |         |        |         |        |         |
+-----------+---------+--------+---------+--------+---------+
| Var1      | Var2    | Var3   | Var4    | Var5   | Var6    |
+-----------+---------+--------+---------+--------+---------+
| 532523    | 25235   | 532523 | 25235   | 532523 | 25235   |
+-----------+---------+--------+---------+--------+---------+
| 25332     | 5325235 | 25332  | 5325235 | 25332  | 5325235 |
+-----------+---------+--------+---------+--------+---------+

The data is not completely unstructured in that it follows this pattern:
The first row only has the file name: file_name
The 2nd row has the table: table1, table2, table3, etc.
And the the actual table itself, i.e. 6 columns from var1 to var6 with the data below it.
Then there are 2 empty rows and the next set would start with the file_name repeating itself and then the next table number and the table within it

All the tables within the CSV follows this pattern, but I am having trouble even loading this into R, I get the following when loading it straight using read.csv():

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

Is this possible to load into one dataframe using R and also have the table number become a column, and var1-var6 + table number as the column headers?

i.e.

+--------+---------+--------+---------+--------+---------+--------------+
|  Var1  |  Var2   |  Var3  |  Var4   |  Var5  |  Var6   | table_number |
+--------+---------+--------+---------+--------+---------+--------------+
| 198824 | 198824  | 198824 | 198824  | 198824 | 198824  | table1       |
| 123    | 1234    | 1242   | 124     | 1241   | 1232    | table1       |
| x      | x       | x      | x       | x      | x       | table2       |
| y      | y       | y      | y       | y      | y       | table2       |
| z      | z       | z      | z       | z      | z       | table2       |
| 532523 | 25235   | 532523 | 25235   | 532523 | 25235   | table3       |
| 25332  | 5325235 | 25332  | 5325235 | 25332  | 5325235 | table3       |
+--------+---------+--------+---------+--------+---------+--------------+

Note that the number of rows for each table (table1, table2, etc.) have different number of rows.

The CSV file has around 200 tables total and is over the Excel limit (I think around 9MM rows)

Using Brian's Code, here are the first few lines:

> lines_all
 [1] "name,,,,,"                      "table1,,,,,"                    "Var1,Var2,Var3,Var4,Var5,Var6"  "321,54312,321,54654,3564,54321"
 [5] "45,54,4564,54,87,456"           ",,,,,"                          ",,,,,"                          "name,,,,,"                     
 [9] "table2,,,,,"                    "Var1,Var2,Var3,Var4,Var5,Var6"  "ssvf,afs,fasf,afsaf,zxvz,zvx"   "saf,zvx,zz,z,zxvz,zxvzxv"      
[13] "zxvsaf,wr,wrw,afsaf,asf,af"     ",,,,,"                          ",,,,,"                          "name,,,,,"                     
[17] "table3,,,,,"                    "Var1,Var2,Var3,Var4,Var5,Var6"  "1,2,3,4,5,6"                    "7,8,9,10,11,12"                
[21] "13,14,15,16,17,18"              "19,20,21,22,23,24"

Brian · Accepted Answer · 2019-10-25T20:32:56.637

Using this file:

file_name
table1
Var1, Var2, Var3, Var4, Var5, Var6
198824, 198824, 198824, 198824, 198824, 198824
123, 1234, 1242, 124, 1241, 1232





file_name
table2
Var1, Var2, Var3, Var4, Var5, Var6
x, x, x, x, x, x
y, y, y, y, y, y
z, z, z, z, z, z





file_name
table3
Var1, Var2, Var3, Var4, Var5, Var6
532523, 25235, 532523, 25235, 532523, 25235
25332, 5325235, 25332, 5325235, 25332, 5325235

First read everything in as character vectors.

library(readr)
library(stringr)
library(purrr)
library(dplyr)
# Could be done in base R, but {readr} will be faster on a large file

# read in all lines
lines_all <- read_lines("nested_tables.txt")
lines_all
#>  [1] "file_name"                                     
#>  [2] "table1"                                        
#>  [3] "Var1, Var2, Var3, Var4, Var5, Var6"            
#>  [4] "198824, 198824, 198824, 198824, 198824, 198824"
#>  [5] "123, 1234, 1242, 124, 1241, 1232"              
#>  [6] ""                                              
#>  [7] ""                                              
#>  [8] ""                                              
#>  [9] ""                                              
#> [10] ""                                              
#> [11] "file_name"                                     
#> [12] "table2"                                        
#> [13] "Var1, Var2, Var3, Var4, Var5, Var6"            
#> [14] "x, x, x, x, x, x"                              
#> [15] "y, y, y, y, y, y"                              
#> [16] "z, z, z, z, z, z"                              
#> [17] ""                                              
#> [18] ""                                              
#> [19] ""                                              
#> [20] ""                                              
#> [21] ""                                              
#> [22] "file_name"                                     
#> [23] "table3"                                        
#> [24] "Var1, Var2, Var3, Var4, Var5, Var6"            
#> [25] "532523, 25235, 532523, 25235, 532523, 25235"   
#> [26] "25332, 5325235, 25332, 5325235, 25332, 5325235"

Search each string for a regex match for your table name. You may need to adjust the matching pattern: "table[0-9]" to match your actual names.

# find where there's a string like "table1"
table_id_indices <- str_detect(lines_all, "table[0-9]")
table_id_indices
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23]  TRUE FALSE FALSE FALSE

# extract the table names in order
table_id_names <- lines_all[table_id_indices]
table_id_names
#> [1] "table1" "table2" "table3"

Now that you've got a vector of strings, and the indices where each ID starts, you can split the vector up.

# split the vector of lines into a list of vectors
# `cumsum` is a handy trick to "fill" from one TRUE value to the next
lines_chunked <- split(lines_all, cumsum(table_id_indices))
lines_chunked
#> $`0`
#> [1] "file_name"
#> 
#> $`1`
#>  [1] "table1"                                        
#>  [2] "Var1, Var2, Var3, Var4, Var5, Var6"            
#>  [3] "198824, 198824, 198824, 198824, 198824, 198824"
#>  [4] "123, 1234, 1242, 124, 1241, 1232"              
#>  [5] ""                                              
#>  [6] ""                                              
#>  [7] ""                                              
#>  [8] ""                                              
#>  [9] ""                                              
#> [10] "file_name"                                     
#> 
#> $`2`
#>  [1] "table2"                            
#>  [2] "Var1, Var2, Var3, Var4, Var5, Var6"
#>  [3] "x, x, x, x, x, x"                  
#>  [4] "y, y, y, y, y, y"                  
#>  [5] "z, z, z, z, z, z"                  
#>  [6] ""                                  
#>  [7] ""                                  
#>  [8] ""                                  
#>  [9] ""                                  
#> [10] ""                                  
#> [11] "file_name"                         
#> 
#> $`3`
#> [1] "table3"                                        
#> [2] "Var1, Var2, Var3, Var4, Var5, Var6"            
#> [3] "532523, 25235, 532523, 25235, 532523, 25235"   
#> [4] "25332, 5325235, 25332, 5325235, 25332, 5325235"

To make the lines readable, remove any non-table lines.

# remove lines that don't have commas, since they're not tables
lines_chunked_cleaned <- map(lines_chunked, ~str_subset(.x, ",")) %>% compact()
lines_chunked_cleaned
#> $`1`
#> [1] "Var1, Var2, Var3, Var4, Var5, Var6"            
#> [2] "198824, 198824, 198824, 198824, 198824, 198824"
#> [3] "123, 1234, 1242, 124, 1241, 1232"              
#> 
#> $`2`
#> [1] "Var1, Var2, Var3, Var4, Var5, Var6"
#> [2] "x, x, x, x, x, x"                  
#> [3] "y, y, y, y, y, y"                  
#> [4] "z, z, z, z, z, z"                  
#> 
#> $`3`
#> [1] "Var1, Var2, Var3, Var4, Var5, Var6"            
#> [2] "532523, 25235, 532523, 25235, 532523, 25235"   
#> [3] "25332, 5325235, 25332, 5325235, 25332, 5325235"

Now each element of the list can be read in as a CSV.

# read in each vector of lines as a CSV
# forcing a default col_type prevents binding errors later
lines_chunked_csvs <- map(lines_chunked_cleaned, ~read_csv(.x, col_types = cols(.default = "c")))
lines_chunked_csvs
#> $`1`
#> # A tibble: 2 x 6
#>   Var1   Var2   Var3   Var4   Var5   Var6  
#>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 198824 198824 198824 198824 198824 198824
#> 2 123    1234   1242   124    1241   1232  
#> 
#> $`2`
#> # A tibble: 3 x 6
#>   Var1  Var2  Var3  Var4  Var5  Var6 
#>   <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 x     x     x     x     x     x    
#> 2 y     y     y     y     y     y    
#> 3 z     z     z     z     z     z    
#> 
#> $`3`
#> # A tibble: 2 x 6
#>   Var1   Var2    Var3   Var4    Var5   Var6   
#>   <chr>  <chr>   <chr>  <chr>   <chr>  <chr>  
#> 1 532523 25235   532523 25235   532523 25235  
#> 2 25332  5325235 25332  5325235 25332  5325235

Use the names from before to identify each dataframe and bind them.

# name the list of tables, bind everything together
bind_rows(set_names(lines_chunked_csvs, table_id_names), .id = "table")
#> # A tibble: 7 x 7
#>   table  Var1   Var2    Var3   Var4    Var5   Var6   
#>   <chr>  <chr>  <chr>   <chr>  <chr>   <chr>  <chr>  
#> 1 table1 198824 198824  198824 198824  198824 198824 
#> 2 table1 123    1234    1242   124     1241   1232   
#> 3 table2 x      x       x      x       x      x      
#> 4 table2 y      y       y      y       y      y      
#> 5 table2 z      z       z      z       z      z      
#> 6 table3 532523 25235   532523 25235   532523 25235  
#> 7 table3 25332  5325235 25332  5325235 25332  5325235

Thank you, I am having trouble on the part where you are doing the lines_chunked_cleaned part, it doesn't seem to be removing the empty rows, I think what is happening is that since my file is a csv file, in the "Lines_chunked" part it actually creates rows of ",,,,," instead of "" 's like you have — Lulumocha, Oct 28 '19 at 19:58
Also in the "table_id_names <- lines_all[table_id_indices]" part, instead of #> [1] "table1" "table2" "table3" like you have, I get [1] "table1,,,,," "table2,,,,," "table3,,,,," — Lulumocha, Oct 28 '19 at 20:03
@Lulumocha, I used my best guess at your actual formatting. Since this question hinges on those details, it would help if you could post some examples of the first few lines. — Brian, Oct 28 '19 at 20:06
so sorry my mistake, I have edited my post to add in the data that comes from reading in the data using read_lines(), I think it is realizing that it does have 6 columns, so it is bring in "file_name,,,,," please see my example on my post I edited — Lulumocha, Oct 29 '19 at 00:54

R importing semi-unstructured data CSV

1 Answers1

Linked