I have a text data file that I likely will read with readLines
. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ...
has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr
in the post above I have tried using gsub
, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.