0

I have a text file and it has multiple interations of data as below. I need to extract selective data from each iteration and put it into a tibble/dataframe. I think I'll have to create some sort of a function that would be able to scan the data and place it into a relevant column. I have no idea how to do it and I have never dealt with this kind of data format. How can I do it?

Source file format
==============
Col1: X
objectClass: top
objectClass: Role
Col2: X1
Col2: X2
User: UserX
User: userY
User: UserZ
cn: P1
description: Permissions
Host: HRef456

Col1: Y
objectClass: top
objectClass: Role
Col2: Y3
Col2: Y4
Col2: Y5
User: U1
User: U2
cn: P2
description: Permissions
Host: HRef123

What I need:
===========

enter image description here

CT_369
  • 69
  • 6
  • 2
    I *think* `read.dcf` should do it - try `read.dcf("filename.txt", all=TRUE)` – thelatemail Aug 11 '20 at 01:50
  • Thanks, but I think I would need some sort of a function for this. – CT_369 Aug 11 '20 at 02:17
  • `read.dcf` is a function. It imports the data into a data.frame. Your data is also uneven, which is going to cause issues here I think. E.g. the first block has 2 `Col2` entries and 3 `User` entries while the reverse is true of the second block. – thelatemail Aug 11 '20 at 02:17
  • Yeah, wondering how to handle this dynamically. I'm not an expert with R functions. :( – CT_369 Aug 11 '20 at 02:24
  • It's not really possible to represent uneven data in a rectangular dataset. I.e. - if there are 3 `Col2`'s in a single block and 2 `User`s, which one goes with which? The `read.dcf` function I mentioned will import the data into a dataset, but you'll have embedded lists, and not a clean format like your requested output. – thelatemail Aug 11 '20 at 03:20
  • a combination of read.dcf and [this](https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows) – rawr Aug 11 '20 at 03:34
  • 2
    CT_369, I think you need to update your expected output to account for the uneven data that @thelatemail mentioned. For example, `as.data.frame(read.dcf("63350291.dcf", all=TRUE))` "works", but it produces `structure(list(Col1="X",objectClass=list(c("top","Role")),Col2=list(c("X1","X2")),User=list(c("UserX","userY","UserZ")),cn="P1",description="Permissions",Host="HRef456"),row.names=1L,class="data.frame")` (1 row), which cannot "recycle" cleanly. – r2evans Aug 11 '20 at 03:36
  • 1
    @rawr - they're not comma-separated strings, they're embedded `list()` objects. – thelatemail Aug 11 '20 at 03:44
  • 1
    @thelatemail failing to see why that is a problem `x[] <- lapply(x, function(y) toString(unlist(y)))` – rawr Aug 11 '20 at 03:59
  • 1
    @rawr - not a problem, just that there'd be another step: `read.dcf`, conversion to strings, then split comma-separated. `tidyr::unnest` or something similar might be more direct. – thelatemail Aug 11 '20 at 04:02
  • @rawr When I execute the below code to first load the data, it errors out. as.data.frame(read.dcf("sample.txt", all=TRUE))
    Error in readLines(file, skipNul = TRUE) : cannot open the connection In addition: Warning message: In readLines(file, skipNul = TRUE) : cannot open compressed file '63350291.txt', probable reason 'No such file or directory'
    – CT_369 Aug 11 '20 at 07:45
  • @CT_369 - `'No such file or directory'` - your text file doesn't exist or you're working in a different directory to where the file is saved. – thelatemail Aug 12 '20 at 01:51
  • @thelatemail This is magical. Thank you so much. When I load the actual data, I end up having multiple comma separated values in the same cell. How do I address that? I mean want them in different columns so that I can then unnest all the values and have a single column. – CT_369 Aug 12 '20 at 06:42

0 Answers0