Converting a structured text file but of non standard structure to dataframe in R

Question

I am new to R. I am trying to learn basic data I/o and preprocessing. I have a text file of the format given below. It is a non standard format (unlike CSV,JSON etc) I need to convert the following structure into a table like format (more precisely a dataframe that we obtain from csv files)

Input

product/productId: B000H13270
review/userId: A3J6I70Z9Q0HRX
review/profileName: Lindey H. Magee
review/helpfulness: 1/3
review/score: 5.0
review/time: 1261785600
review/summary: it's fabulous, but *not* from amazon!
review/text: the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!

product/productId: B000H13270
review/userId: A1YLOZQKBX3J1S
review/profileName: R. Lee Dailey "Lee_Dailey"
review/helpfulness: 1/4
review/score: 3.0
review/time: 1221177600
review/summary: too expensive
review/text: howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee

Output

product/productId | review/UserId ......... | review/text
B000H13270        |A3J6I70Z9Q0HRX           |  the price on this .... dissapointing!
B000H13270       | A1YLOZQKBX3J1S          |howdy y'all,<br /> ..... lee

In Python I could have performed the same in the following manner

dataFile = open('filename').read().split('\n') # obtain each data chunk
revDict = dict()
for item in dataFile:
    stuff = item.split(':')
    revDict[stuff[0]].append(stuff[1])

How something similar can be achieved in R. Are there any equivalents in R

score 1 · Answer 1 · answered Sep 24 '15 at 03:04

There are a lot of ways of doing this. Here's how I would do it, using readLines, tidyr and dplyr:

library(dplyr)
library(tidyr)
con <- file("mytxt.txt", "r", blocking = FALSE)
z <- readLines(con)
z <- as.data.frame(z) %>% separate(z, into = c("datatype", "val"), sep=": ") %>%
         mutate(rep = cumsum(datatype=="product/productId")) %>% 
         na.omit() %>%
         spread(datatype, val)

You'll get an output in a dataframe like:

  rep product/productId review/helpfulness         review/profileName review/score
1   1        B000H13270                1/3            Lindey H. Magee          5.0
2   2        B000H13270                1/4 R. Lee Dailey "Lee_Dailey"          3.0
                         review/summary
1 it's fabulous, but *not* from amazon!
2                         too expensive
                                                                                                                                                                                                                                                                                                                                                                                                      review/text
1                                                                                                                                                                                                                               the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
2 howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
  review/time  review/userId
1  1261785600 A3J6I70Z9Q0HRX
2  1221177600 A1YLOZQKBX3J1S

Ritchie Sacramento · Accepted Answer · 2015-09-24T03:26:33.013

Here's a quick and dirty way that splits on colons (all colons except the first on each line are removed from the file) then reshapes the data from long to wide:

mytxt <- readLines(file("mytext.txt"))
mytable <- read.table(text=gsub("^([^:]*:)|:", "\\1", mytxt), sep = ":", quote = "")
mytable$id <- rep(1:(nrow(mytable)/8), each = 8)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")

Which gives:

  id V2.product/productId V2.review/userId           V2.review/profileName  V2.review/helpfulness V2.review/score V2.review/time                      V2.review/summary                                                                                                                                                                                                                                                                                                                                                                                                    V2.review/text
1  1           B000H13270   A3J6I70Z9Q0HRX                 Lindey H. Magee                   1/3             5.0     1261785600  it's fabulous, but *not* from amazon!                                                                                                                                                                                                                                the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
9  2           B000H13270   A1YLOZQKBX3J1S  R. Lee Dailey \\"Lee_Dailey\\"                   1/4             3.0     1221177600                          too expensive  howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee

Assumes that each case consists of 8 lines.

The assumption that each case has 8 lines is a valid assumption. — Amrith Krishna, Sep 24 '15 at 03:39

score 1 · Answer 3 · answered Sep 24 '15 at 03:33

Here is a 'poor man' method.

I assume that all blocks of data has the same fields, there is no missing fields, and : is use only as separator.

You have 8 fields, in the example I use 3 and simplify its names.

fields <- 3

# you can use file="example.txt" instead text=...
data <- read.table(text="
    prod: foo  1 
    rev1: bar 11
    rev2: bar 12

    prod: foo  2
    rev1: bar 21
    rev2: bar 22
  ", sep=":", strip.white=TRUE, stringsAsFactors=FALSE)

rows <- dim(data)[1]/fields

mdata <- matrix(data$V2, nrow=rows, ncol=fields, byrow=TRUE)

colnames(mdata) <- data$V1[1:fields]

as.data.frame(mdata)

Result:

     prod    rev1    rev2
1  foo  1  bar 11  bar 12
2  foo  2  bar 21  bar 22

Converting a structured text file but of non standard structure to dataframe in R

3 Answers3

Linked