5

I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.

I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?

Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:

all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation:    ",all_data)
all_data[hours_of_operation]
[1] "    Annual Hours of Operation:    8760.0 hours/yr"

Thanks

J

Cyrus
  • 84,225
  • 14
  • 89
  • 153
squishy
  • 489
  • 1
  • 6
  • 19
  • Does combining `grep` and `sub` or `regexpr` not work? Those alone can give you a vector (per file), from which can arbitrarily choose the third (or other) element programmatically. – r2evans Feb 18 '15 at 20:12
  • @JThomp: I am curious to know if the answers helped you to find a solution to your problem? – Ruthger Righart Feb 23 '15 at 10:37
  • @RuthgerRighart Sorry for the delay - this is a side project to speed up processes. Thank you for your information, however, while this gives me the ability to select and index lines when they appear, I am struggling to select the numbers from the string where the size of numbers are very variable. Pre-selecting number of decimal places requires foreknowledge of values. The other issue is that there are tables with columns within the file where I need to extract only one value from a row. – squishy Mar 05 '15 at 17:08
  • This issue is now resolved by expanding the selection [0-9]{1,9}.[e-e0-9]{1,9}[+-][0-9]{1,9}. I was worried this may cause it to extract unwanted information from the next word in the string, but apparently it only applies to the discrete sting I'm after. Thanks for your help! – squishy Mar 05 '15 at 20:42

2 Answers2

4

I am guessing that you have one data point on each line that you want to parse. If so, you can read the data into a vector and use the grepl() function to find all instances of the vector that have what you need.

So for example you have the data:

lhr: time to departure 5:00
dfw: time to arrival 4:40
jfk: time to arrival 5:50
dfw: time to departure 6:00
lax: time to departure 6:00

And you want to take out the "dfw: " entries then you do

data = readLines("file.txt")
data[grepl("dfw: ", data)]

And if you want the second entry of this, you do

data[grepl("dfw: ", data)][2]
Allen Wang
  • 2,426
  • 2
  • 24
  • 48
1

The following may help. I assume that you brought your text to character vector(s)

Data example

Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)

ex<-c("The annual observed flow rate: 99.99")

Regexpr & Regmatches

Here regexpr searches for a number with two digits before and after the period.

res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))

Using position parameters

Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.

library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)

If flow rate appears multiple times

Split the elements of the vector into substrings and capture the numbers as before.

ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33