0

I am trying to delimit the following data by first name, time stamp, and then the text. Currently, the entire data is listed in 1 column as a data frame this column is called Text 1. Here is how it looks

text

First Name:          00:03       Welcome Back text text text
First Name 2:        00:54       Text Text Text
First Name 3:        01:24       Text Text Text

This is what I did so far:

text$specificname = str_split_fixed(text$text, ":", 2)

and it created the following

text                                                            specific name

First Name:          00:03       Welcome Back text text text    First Name
First Name 2:        00:54       Text Text Text                 First Name2
First Name 3:        01:24       Text Text Text                 First Name 3

How do I do the same for the timestamp and text? Is this the best way of doing it?

EDIT 1: This is how I brought in my data


#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'

#Reading the HTML code from the website
wp = read_html(url)

#assignging the class to an object
alltext = html_nodes(wp, 'p')

#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)
  • 2
    How did you read the data in? It looks like you might have fixed-width data. There are functions to read that in properly. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. I'm unclear as to what your exact input is. – MrFlick Sep 24 '19 at 15:22
  • edited my original post to address this – Shehroz Malik Sep 24 '19 at 15:41

1 Answers1

0

Assuming that text is in the form shown in the Note at the end, i.e. a character vector with one component per line, we can use read.table

read.table(text = gsub("  +", ",", text), sep = ",", as.is = TRUE)

giving this data.frame:

             V1    V2                          V3
1   First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54              Text Text Text
3 First Name 3: 01:24              Text Text Text

Note

Lines <- "First Name:          00:03       Welcome Back text text text
First Name 2:        00:54       Text Text Text
First Name 3:        01:24       Text Text Text"

text <- readLines(textConnection(Lines))

Update

Regarding the EDIT that was added to the question define a regular expression pat which matches possible whitespace, 2 digits, colon, 2 digits and possibly more whitespace. Then grep out all lines that match it giving tt and in each line left replace the match with @, the pattern (except for the whitespace) and @ giving g. Finally read it in using @ as the field separator giving DF.

pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "@\\1@", tt)
DF <- read.table(text = g, sep = "@", quote = "", as.is = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your comment. Need some clarification, does this mean I have to import my data in a different method using read.table, instead of my current way of using the HTML tabs? – Shehroz Malik Sep 24 '19 at 15:52
  • I understand, thank you. However, i have ~600 vectors. The data is [1] first name1: 00:01 TEXT TEXT TEXT [2] first name2: 00:59 TEXT TEXT TEXT [3] first name3: 04:20 TEXT TEXT TEXT – Shehroz Malik Sep 24 '19 at 16:09
  • Regarding the EDIT added to the question see the Update section of the answer. – G. Grothendieck Sep 24 '19 at 23:14
  • Dude.................................I have been trying to do this since 10AM this morning, and its 8PM now... TY SO MUCH. Now I need to analyze your code and figure out how you did it. – Shehroz Malik Sep 25 '19 at 00:05
  • what is the significance of the \\1 inside @\\1@? – Shehroz Malik Jul 03 '20 at 22:09