0

I have a txt file with 100,000+ lines of data. I want to turn it into a dataframe but do not need every line of data. An example of the data entry looks like this:

FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Yang, Qiang
   Liu, Yang
   Chen, Tianjian
   Tong, Yongxin
TI Federated Machine Learning: Concept and Applications
SO ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
VL 10
IS 2
AR 12
DI 10.1145/3298981
DT Article
PD FEB 2019
PY 2019
AB Today's artificial intelligence still faces two major challenges (...) etc. 

I only want the rows that begin TI, AU, PD, AB and extract them into corresponding named columns. This is as far as I have gotten too and I am really struggling!

read.table("groupprojectdatabase.txt", header = FALSE, sep = ",", quote = "",
           dec = ".", numerals = c("allow.loss"),
           row.names = c("TI", "AU", "PB","AB"), col.names = c('title_col','author_col','date_col','summary_col'), as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = FALSE,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = FALSE,
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

Any help would be really appreciated, even if it was what functions I need to look up or if I am on the right tracks. I was thinking that sep = command is relevant but I couldnt work out how to tell it to skip everything but the TI,AU,PB and AB rows

In particular I am not sure how to program R to treat entire sentences as variables, not each word etc.

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 4 elements
Phil
  • 7,287
  • 3
  • 36
  • 66
  • Welcome to StackOverflow. Can you edit your post to make your question [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for others by creating a minimal example that reproduces the same error message? – jrcalabrese Dec 02 '22 at 17:35

1 Answers1

0

I have made a file test.txt based on your data above. After having some problems using read.table I switched to read::read_delim from the tidyverse.

This reads the file line by line. This line is then separated by the first whitespace, i.e. after the first 2 letters.

Because there were 4 lines (AU first two letters) which belong together the last part of the code below bings those lines together.

library(tidyverse)

df <- read_delim("path_to_your/test.txt", delim = ";", col_names = TRUE)

ddf <- df |> 
  separate(`FN Clarivate Analytics Web of Science`, 
           into = c("first", "rest"), 
           sep = " ", extra = 'merge') |> 
  mutate(first = ifelse(first == "", NA, first)) |> 
  fill(first) |> 
  group_by(first) |> 
  mutate(rest = paste0(rest, collapse = "")) |> 
  distinct(first, .keep_all = T)
  
ddf |> 
  filter(first %in% c('TI', 'AU', 'PD', 'AB'))

#> # A tibble: 4 × 2
#> # Groups:   first [4]
#>   first rest                                                            
#>   <chr> <chr>                                                           
#> 1 AU    Yang, Qiang  Liu, Yang  Chen, Tianjian  Tong, Yongxin           
#> 2 TI    Federated Machine Learning: Concept and Applications            
#> 3 PD    FEB 2019                                                        
#> 4 AB    Today's artificial intelligence still faces two major challenges
MarBlo
  • 4,195
  • 1
  • 13
  • 27