2

I want to read a large .txt file into R using the vroom package, because is fast and supports pipe connections for pre-filtering.

For reproducibility, let's read this UK cats csv file from the Tidy Tuesday project and pre-filter for id == "Ares". The first column corresponds to the tag_id.

The following code returns an empty dataframe. How to fix the filter and what changes are required to filter by regular expressions instead of == "Ares"?

cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

vroom(
  file = pipe(paste("awk -F ',' '{ if ($1 == 'Ares') { print } }'", cats_file)),
  delim = ","
)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
dzegpi
  • 554
  • 5
  • 14
  • 1
    Are you downloading the file onto your local computer before using `awk`? Or are you trying to use `awk` with a URL? – MrFlick Aug 16 '23 at 17:32
  • 3
    To filter by regular expressions replace the string comparison operator `==` with the regexp comparison operator `~`, see the awk man page. By the way, `'{ if ($1 == "Ares") { print } }'` can be written as just `'$1 == "Ares"'`, condition testing is implicit and printing the current record is the default action. – Ed Morton Aug 16 '23 at 19:28

2 Answers2

3

Inside an awk script literal string values need to be wrapped in double quotes, eg:

($1 == "Ares")

Single quotes are used to delimit awk script/code; in this case awk sees 3 chunks of script/code:

  • { if ($1 == +
  • Ares +
  • ) { print }}

which awk concatenates into:

{ if ($1 == Ares) { print } }

This translates into awk comparing $1 to whatever's in the variable named Ares which in this case is undefined (aka empty string) so $1 == <empty_string> fails and nothing is printed.

I'm assuming you would need to escape the embedded double quotes, eg:

file = pipe(paste("awk -F ',' '{ if ($1 == \"Ares\") { print } }'", cats_file)),
                                           ^^    ^^

NOTE: I don't work with r/vroom so I'm assuming the rest of OP's code should work once the awk script is modified.


As Ed Morton has mentioned in comments the following should also work:

awk -F ',' '{ if ($1 ~ /Ares/) { print } }'

file = pipe(paste("awk -F ',' '{ if ($1 ~ /Ares/) { print } }'", cats_file)),

####
#### or
####

awk -F ',' '$1 ~ /Ares/'

file = pipe(paste("awk -F ',' '$1 ~ /Ares/`", cats_file)),

####
#### or
####

awk -F ',' '$1 == "Ares"'

file = pipe(paste("awk -F ',' '$1 == \"Ares\"'", cats_file)),
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • 1
    I really liked your 2nd suggestion, with `~ /regex/`. For some reason, this works on a toy file I just created, but not for the cats example. I wonder why. Another problem is that I'm skipping the table header. How to fix this behaviour? – dzegpi Aug 16 '23 at 20:22
  • 1
    I was focused solely on explaining/fixing the current code; I wasn't focused on 'improving' the code (as Ed was), so while I know about `~ /Ares/` you can thank Ed for bringing it up in the comments; as for the issue you're having, for me `$1 ~ /Ares/` prints 103 lines from that cats file (and it skips the header line); if this code works some times but not others then I'd suggest asking a new question, making sure to provide a physical example of a file (eg, 5-10 lines) where the code does not work; also be sure to explain the header issue (ie, you (do not) want to print the header?) – markp-fuso Aug 16 '23 at 21:13
  • I want to print the header. Did you got those 103 rows by using vroom? I think the cause of the problem might be that the dataset is remote, so awk or grep ar not being executed on the plain text – dzegpi Aug 18 '23 at 21:29
  • 1
    I downloaded (via `wget`) the cats file so that I was working on a local copy; `awk` doesn't know how to work on 'remote' files unless *a)* a copy is downloaded, *b)* you can somehow stream the file into `awk's` stdin, or *c)* push the `awk` code to the remote host and run it there (likely not an option for a web target); I don't work with `r/vroom`; to print the header in any of the answers add `FNR==1;` at the beginning of the `awk` script (eg, `awk -F',' 'FNR==1; $1 ~ /Ares/'` – markp-fuso Aug 18 '23 at 21:51
  • additional comment about the `$1 ~ /Ares/` answer ... that will match on any string that *contains* `'Ares'`, eg, it will match on `PreAres`, `Arespit` – markp-fuso Aug 18 '23 at 21:52
  • Thanks for your answers. I do not know how `vroom()` passes (if it does) the data to `awk` or `grep` using a pipe connection, and does not seem to be a topic in the github repository of vroom, so I [raised an issue](https://github.com/tidyverse/vroom/issues/508). – dzegpi Aug 19 '23 at 00:37
1

1) The file linked to is not so large that you need to do anything special. Try this.

library(dplyr)
library(vroom)

Ares <- cats_file |>
  vroom() |>
  filter(tag_id == "Ares") 

2) This also works

Ares <- cats_file |>
  vroom() |>
  filter(grepl("^Ares$", tag_id))

3) This works too and uses no packages:

Ares <- cats_file |>
  read.csv() |>
  subset(tag_id == "Ares") 

4) Another approach if all the Ares rows are together (which it appears is the case) is to just read in the first column and then use that to pick out the rows needed.

tag_id <- read.csv(cats_file, colClasses = c(NA, rep("NULL", 10)))
rng <- range(grep("Ares", tag_id$tag_id))
Ares <- read.csv(cats_file, skip = rng[1] - 1, nrows = diff(rng) + 1)

An alternative if we knew that the tag_id field is first is to replace the tag_id line above with

tag_id <- read_csv(cats_file, comment_char = ",")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • The linked file is meant to be used as an example. The skip approach is interesting. I wonder if it is more efficient than reading the pre-filtered file at once. – dzegpi Aug 16 '23 at 20:20
  • When I tried it it seemed that just using plain vroom was faster. – G. Grothendieck Aug 16 '23 at 20:33