0

I've got thousands of textfiles which 10-thousands of lines in different structure in a textfile. It looks like the following 3 lines:

DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16

A # symbolizes normally a break between name of data and information. Sometimes there is another deeper level where # changes to ° and = changes to *. The lines in the original data have got about 10.000 signs per line. I am searching in each line just for the REFERENZ which can apear multiple times. E.g. in line 1.

The result of the read-function for this 3 lines should be a data.frame like this:

> Daten = data.frame(REFERENZ = c(23,24,25))
> str(Daten)
'data.frame':   3 obs. of  1 variable:
$ REFERENZ: num  23 24 25

Dies anybody knows a function in R which can search for this?

T. Beige
  • 177
  • 12
  • Does this existing question help? https://stackoverflow.com/questions/12626637/read-a-text-file-in-r-line-by-line. As long as you have enough memory it doesn't seem like a problem to just read all the lines with `readLines()` and then use a regular expression to extract the data you want. – MrFlick May 06 '22 at 13:14

1 Answers1

0

I am using read_lines()function from readr package for problem like that.

library(readr)
library(data.table)

t1 <- read_lines('textfile.txt')
table <- fread(paste0(t1, collapse = '\n'), sep = '#')

EDIT: I misunderstood the question, my bad. I think you are looking for REGEX.

library(readr)
library(stringr)

t1 <- 'DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16'
t1 <- read_lines(t1)

Daten = data.frame(REFERENZ = str_extract(str_extract(t1, 'REFERENZ\\W\\d*'), '[0-9]+'))
str(Daten)
gokhan can
  • 189
  • 9