5

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.

Here is a similar post on Stackoverflow that will locate the last dot:

R: Find the last dot in a string

However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.

In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.

Here is an example data set and the outcome I hope to achieve:

aa = matrix(c(
'first string of junk... 0.2 0 1', 
'next string ........2 0 2', 
'%%%... ! 1959 ...  0 3 3',
'year .. 2 .,.  7 6 5',
'this_string   is . not fine .•. 4 2 3'), 
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))

aa <- as.data.frame(aa, stringsAsFactors=F)
aa

# desired result
#                             C1  C2 C3 C4
# 1        first string of junk  0.2  0  1
# 2            next string .....   2  0  2
# 3             %%%... ! 1959      0  3  3
# 4                 year .. 2      7  6  5
# 5 this_string   is . not fine    4  2  3

I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.

Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.

Thank you for any advice.

Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • Can you do a search and replace all commas and big dots into regular dots first? – Feng Mai Jun 20 '12 at 19:55
  • I do not think I can replace the commas with dots because the data contain commas in the bigger numbers: 4,500. I forgot to mention that in my post. Although maybe I could replace the commas with dots and then remove the dots from the data after I eliminate the gibberish. – Mark Miller Jun 20 '12 at 19:58

3 Answers3

5

This does the trick, though not especially elegant...

options(stringsAsFactors = FALSE)


# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))

# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))


# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))

# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")
Matt Parker
  • 26,709
  • 7
  • 54
  • 72
  • It's worth noting that the 'big dot' gave me trouble when I tried sending this code from Vim - yet when copied from the website, it works fine. So my flow was to edit in Vim, paste to the website, and then paste to my console... that ain't right. – Matt Parker Jun 20 '12 at 20:49
  • It looks like maybe the code is assigning the numbers 4,2,3 (from the last string) to all 5 strings in the data set. – Mark Miller Jun 20 '12 at 21:21
  • @MarkMiller Ah, sorry - I was working with the `aa` matrix, not as a data.frame. If you want to use a data.frame, you can just assign `nums` like this: `as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))` – Matt Parker Jun 20 '12 at 21:27
2

This will get you most of the way there, and it will have no problems with numbers that include commas:

# First, use a regex to eliminate the bad pattern.  This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by 
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
  apply(aa, 1, function (x) 
    gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))

# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter, 
# digit, or space, and (b) followed by a digit.  The result is a 
# list, each element of which is a list containing the parts of 
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x) 
  strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))  

# Remove the second element in aa.  There is no space before the 
# first data column in this string.  As a result, strsplit() split
# it into three columns, not 4.  That in turn throws off the code
# below.
aa.list <- aa.list[-2]

# Make the data frame.
aa.list <- lapply(aa.list, unlist)  # convert list of lists to list of vectors
aa.df   <- data.frame(aa.list)      
aa.df   <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE) 

The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.

user697473
  • 2,165
  • 1
  • 20
  • 47
  • If I add a space between the last dot and the 2 in the second string could you modify the aa.list line to handle it? In my real data I think there always was a space after the last dot and I just did not realize it when I created 'aa'. I can also try to figure out how to modify aa.list. – Mark Miller Jun 20 '12 at 21:24
  • Yes, if you add a space between the last dot and the in the second string, the regular expression in the second step could be modified to handle that string. It's a little tricky, but doable. That said, I think that @MattParker has a better idea: start by separating each of your strings into a "bad" part (first column) and a well-behaved part (data columns). Then apply regular expressions to the first column. Then rejoin the two parts. If you do it this way, you can keep the regular expression in `strsplit` pretty simple. Otherwise, the regular expression is going to be more complex. – user697473 Jun 20 '12 at 21:58
0

Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result

[haiku-pseudocode]

a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match 

ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'

// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex

[/haiku-pseudocode]

KevinDTimm
  • 14,226
  • 3
  • 42
  • 60