3

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:

"

... (number of rows)

... (number of rows)

... (number of rows)

DATE VALUE

1945-01-01 144855

1946-01-01 138515

1947-01-01 136405

1948-01-01 135486

1949-01-01 142455

"

I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?

Thank you very much in advance for your effort and your time!

Best,

c.

Jaap
  • 81,064
  • 34
  • 182
  • 193
cthulhukk
  • 99
  • 1
  • 7
  • Possible duplicate of [Read csv from specific row](https://stackoverflow.com/questions/6592219/read-csv-from-specific-row) – Bea Jun 12 '17 at 20:00
  • Thank you, Bea, for pointing out a possible solutions. Unfortunately, in this case the first few rows contain a lot of text, so that the following error message is being displayed: "Error in fread("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt") : Expecting 2 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=':' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed." – cthulhukk Jun 12 '17 at 20:05
  • To be more specific, due to the unstructured nature in the rows above, R won't recognise that the "DATE" and "VALUE" columns are in fact two distinct columns if everything is read, displaying: "Error in fread("https://fred.stlouisfed.org/data/HNOCEAA027N.txt") : Expecting 2 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE [...]" – cthulhukk Jun 12 '17 at 20:14
  • Using `DT <- fread('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt', skip = 35)` works for me. – Jaap Jun 12 '17 at 20:28
  • Thanks, Jaap. For this specific case it does, but I have a number of different links with varying rows above, but always the DATE/VALUE row as the first one I would like to read. So, a generic way to identify the DATE/VALUE row would be helpful. Sorry, should have stated that in my question. – cthulhukk Jun 12 '17 at 20:36
  • do you have another link? – Jaap Jun 12 '17 at 20:41
  • Yes, for example: https://fred.stlouisfed.org/data/DGS10.txt and https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt. – cthulhukk Jun 12 '17 at 20:46
  • Could you determine the `skip` value with:`raw <- readLines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt", n=-1)` and `start <- intersect(grep("DATE",raw),grep("VALUE",raw))` – CPak Jun 12 '17 at 20:55
  • Thank you very much, Chi Pak! Works great with the following code now: library(data.table) readLines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt", n=-1) start1 <- intersect(grep("DATE",raw),grep("VALUE",raw)) mydata <- fread('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',skip=start-1) head(mydata) So, thanks again! – cthulhukk Jun 12 '17 at 21:11

3 Answers3

2

Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.

library(tidyverse)
library(stringr)

df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
  filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
  mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
         value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
  select(-lines)

df
#> # A tibble: 286 x 2
#>          date value
#>         <chr> <dbl>
#>  1 1945-10-01  1245
#>  2 1946-01-01    NA
#>  3 1946-04-01    NA
#>  4 1946-07-01    NA
#>  5 1946-10-01  1298
#>  6 1947-01-01    NA
#>  7 1947-04-01    NA
#>  8 1947-07-01    NA
#>  9 1947-10-01  1413
#> 10 1948-01-01    NA
#> # ... with 276 more rows

I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Thanks to you too, analysisparalysis! That will give me something to think about...8) Never used the packages you are using, but will figure it out! Thanks again! – cthulhukk Jun 12 '17 at 21:16
2

Combining fread with unix tools:

> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
           DATE   VALUE
  1: 1945-10-01    1245
  2: 1946-01-01       .
  3: 1946-04-01       .
  4: 1946-07-01       .
  5: 1946-10-01    1298
 ---                   
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46
  • Very nice & clean solution! To generalize it for reading multiple files at once, you can do: `lapply(file.names, function(x) fread(paste0("curl -s ", x, " | sed -n -e '/^DATE.*VALUE/,$p'")))` – Jaap Jun 13 '17 at 19:38
  • @Jaap You could wrap that in an rbindlist – Clayton Stanley Jun 13 '17 at 20:30
  • @Jaap And you might try sprintf rather than paste; I find separating the template string from the vars inserted into it to be more readable than the paste approach. – Clayton Stanley Jun 14 '17 at 00:12
  • Nvm @Jaap thought you were the OP; you likely know all this – Clayton Stanley Jun 14 '17 at 00:22
  • 1
    np, `sprintf` is indead a good alternative; with regard to binding everything together, I've explained it more extensively [here](https://stackoverflow.com/questions/32888757/reading-multiple-files-into-r-best-practice/32888918#32888918) – Jaap Jun 14 '17 at 07:05
1

Using:

file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
                'https://fred.stlouisfed.org/data/DGS10.txt',
                'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')

text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1

# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)

# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))

will get you a list of data.frame's (option 1) or data.table's (option 2).

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Neat solution, Jaap! Thank you very much for your help and effort! Will have to think about the code a bit in order to properly understand it though...8) Unfortunately only starting with R. I appreciate the helpful community! Hat tip to Chi Pak, as well! – cthulhukk Jun 12 '17 at 21:13