R system call to awk fail

Question

I have a log file, let's call it mylogfile.txt

Format is date-timestamp, then semicolon delimiter, then some other stuff that I am, for the purposes of this exercise, unconcerned with.

eg (this is all one line in the log file - not sure how to present as such in SO so apologies)

20170710-23:59:43.158;B@13.43434@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#B@13.41834@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.47274@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.48874@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#

What I am currently attempting is simply a proof of concept example. I wish to parse the file, reverse the row order, and return two columns in the output -

1) Just the timestamp parsed from column one (which is a date-time format so I need to discard the date portion)

2) That timestamp expressed in seconds since midnight , expressed to millisecond precision (in line with the granularity of the timestamps themselves.

so from the single line example below the output would be eg

23:59:43.158,86383.158

I can get halfway there. I can construct a call to awk using syntax which works perfectly well within cygwin (stripped of the R wrapper naturally). But it doesn't work within R

testawk<-paste0("tac ", mylogfile.txt, " | awk 'BEGIN {FS=\"-|;|:\"} {OMFT=\"%.3f\"} {print $2 \":\" $3 \":\" $4 \",\" (3600*$2)+(60*$3)+$4}' ")

getawk<-as.data.frame(system(testawk, intern=TRUE, show.output.on.console = FALSE))

However what ends up in the data frame getawk is simply the raw log file churning through as it's being read. Plus I get the warning message that running command had status 1.

HOWEVER

if I strip out the 'tac' piece and just use straight awk, thus;

 testawk<-paste0("awk 'BEGIN {FS=\"-|;|:\"} {OMFT=\"%.3f\"} {print $2 \":\" $3 \":\" $4 \",\" (3600*$2)+(60*$3)+$4}' ", mylogfile.txt)

    getawk<-as.data.frame(system(testawk, intern=TRUE, show.output.on.console = FALSE))

I get the error message

Error in system(testawk, intern = TRUE, show.output.on.console = FALSE) : 'awk' not found

I don't think the problem is in my awk construction as it works fine if I simply do it within cygwin. So there's clearly some facet of the r / system / awk interaction that I am not quite fully grasping.

I imagine if I wrapped this all up in an awk script and simply called the script it may work, but I am frustrated that I can't simply find the right syntax to invoke awk directly with the R system command (I handle grep, sed commands etc that way ok).

It's not as simple as awk not actually being supported at all is it?

Pointers greatly appreciated. If the first say 20 lines of the logfile would be useful I can post those too.

Why you want to use `awk` for this? You can do it in R without much pain. — nicola, Aug 01 '17 at 12:38

mlegge · Accepted Answer · 2017-08-01T13:56:55.033

1

This often happens when trying to use other languages with R, e.g. Python. If you haven't added the paths to your Windows system path then you haven't told RStudio where to find the executables.

The root of Cygwin is normally found at C:\cygwin64 (but could vary by your installation) so find the install and look for the bin folder. In there should be the awk executable, but it is normally just a symlink to a gawk executable (verify yourself) so add that to the PATH, e.g.:

Sys.setenv(PATH = paste("C:/cygwin64/bin/gawk", Sys.getenv("PATH"), sep = ":"))

NOTE: This does not add permanently so you must start at the beginning of each session or add to your Windows path to have it recognized permanently.

edited Aug 01 '17 at 13:56

answered Aug 01 '17 at 12:37

mlegge

6,763
3
40
67

/usr/bin/awk was returned. I will add that. Thanks (and also to the poster below). – Pascoe Aug 01 '17 at 12:52
Sadly didn't work. Still getting awk not found error. As I am running all this on a windows box, do I need a more 'windowey' file path? – Pascoe Aug 01 '17 at 13:19
ok thanks. Turns out it was in c:/installs/cygwin64/bin/awk so I tried running the command using that. Still no joy. Trying an RStudio restart. – Pascoe Aug 01 '17 at 13:46
did you use a capital C? You will also need to run the command every time you restart RStudio as this doesn't permanently add to the PATH – mlegge Aug 01 '17 at 13:51
Ahh ok. Well success from a different angle. I looked at the awk executable and it seemed to merely be a link to a gawk one. SO I tried invoking the R command using gawk instead and that seems to work fine. In terms of needing to run the command every time, I guess I can easily enough add it to my startup script that is a part of my usual routine. So thanks very much for the help. Got there in the end. – Pascoe Aug 01 '17 at 13:54
DOH - I speed read your amendment mlegge and spotted the path but genuinely NOT that you had also solved the gawk problem. I am very sorry - looks like I was trying to pass your success off as my own. Honestly not the case. Just trying to do a zillion things at once. Help really greatly appreciated. – Pascoe Aug 01 '17 at 14:03
@Pascoe, I made that edit to recognize your solution, it was your discovery – mlegge Aug 01 '17 at 14:10
Ahhhhh - well ok then. First useful thing I've posted on SO. Great. Might go home now - my work here is done ;-) – Pascoe Aug 01 '17 at 14:15

score 1 · Answer 2 · answered Aug 01 '17 at 12:42

1

sounds like 'awk' is simply not found, maybe it's not in your PATH. Try putting in the full path to awk, e.g. '/usr/bin/awk'. I'm not using Windows and Cygwin, so your real path will certainly be different.

answered Aug 01 '17 at 12:42

Habakuk

11
2

score 0 · Answer 3 · answered Aug 01 '17 at 13:09

Just do it all in R:

c(
  "20170710-10:31:26.121;B@13.43434@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#B@13.41834@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.47274@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.48874@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#",
  "20170710-23:59:43.158;B@13.43434@1000000.0@20170710-21:15:53.23@@2017071023:59:43.158@@T@20170710-23:59:43.156#B@13.41834@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.47274@1000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#A@13.48874@4000000.0@20170710-21:15:53.23@@20170710-23:59:43.158@@T@20170710-23:59:43.156#"
) -> log_lines

# you'd get the above with `log_lines <- readLines('filename')`

matched <- stringi::stri_match_first_regex(log_lines, "([[:digit:]]+:[[:digit:]]+:[[:digit:]]+\\.[[:digit:]]+)")[,2]

cat(
  rev(
    sprintf(
      "%s,%s\n", 
      matched, 
      lubridate::hms(matched) %>% 
        as.numeric() %>% 
        sprintf("%9.3f", .)
    )
  ),
  sep=""
)

That makes:

10:31:26.121,37886.121
23:59:43.158,86383.158

and, you can cat to a file or store that in a data frame (etc).

I grok that awk might be more familiar to you, but it makes absolutely no sense to use it.

Thanks. I do know how to do this stuff in R, but the question was specifically how to call awk. As I stated earlier, that particular example was merely that. An example. I have other use cases also. — Pascoe, Aug 01 '17 at 13:17

R system call to awk fail

3 Answers3