I have a txt file with 100,000+ lines of data. I want to turn it into a dataframe but do not need every line of data. An example of the data entry looks like this:
FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Yang, Qiang
Liu, Yang
Chen, Tianjian
Tong, Yongxin
TI Federated Machine Learning: Concept and Applications
SO ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
VL 10
IS 2
AR 12
DI 10.1145/3298981
DT Article
PD FEB 2019
PY 2019
AB Today's artificial intelligence still faces two major challenges (...) etc.
I only want the rows that begin TI, AU, PD, AB and extract them into corresponding named columns. This is as far as I have gotten too and I am really struggling!
read.table("groupprojectdatabase.txt", header = FALSE, sep = ",", quote = "",
dec = ".", numerals = c("allow.loss"),
row.names = c("TI", "AU", "PB","AB"), col.names = c('title_col','author_col','date_col','summary_col'), as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = FALSE,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
Any help would be really appreciated, even if it was what functions I need to look up or if I am on the right tracks. I was thinking that sep = command is relevant but I couldnt work out how to tell it to skip everything but the TI,AU,PB and AB rows
In particular I am not sure how to program R to treat entire sentences as variables, not each word etc.
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 4 elements