0

I have a huge list of text files (50,000+) that contain normal sentences. Some of these sentences have words that have merged together because some of the endlines have been placed together. How do I go about unmerging some of these words in R?

The only suggestion I could get was here and kind of attempted something from here but both suggestions require big matrices which I can't use because I either run out of memory or RStudio crashes :( can someone help please? Here's an example of a text file I'm using (there are 50,000+ more where this came from):

Mad cow disease, BSE, or bovine spongiform encephalopathy, has cost the country dear.
More than 170,000 cattle in England, Scotland and Wales have contracted BSE since 1988.

More than a million unwanted calves have been slaughtered, and more than two and a quarter million older cattle killed, their remains dumped in case they might be harbouring         the infection.

In May, one of the biggest cattle markets, at Banbury in Oxfordshire, closed down. Avictim at least in part, of this bizarre crisis.

The total cost of BSE to the taxpayer is set to top £4 billion.

EDIT: for example: "It had been cushioned by subsidies, living in an unreal world. Many farmers didn't think aboutwhat happened beyond the farm gate, because there were always people willing to buy what they produced."

See the 'aboutwhat' part. Well that happens to about 1 in every 100 or so articles. Not this actual article, I just made the above up as an example. The words have been joined together somehow (I think when I read in some articles some of them have missed spaces or my notepad reader joins the end of one line with another).

EDIT 2: here's the error I get when I use variation of what they have here replacing the created lists with read-in lists:

Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 627 

I've never seen that error before but it does come up here and here but no solution to it on either :(

Community
  • 1
  • 1
user72423
  • 1
  • 2
  • Is each sentence on its own line? And what do you mean when you say the words have "merged"? – Justin Mar 17 '14 at 13:53
  • @Justin I'll edit with an example now. – user72423 Mar 17 '14 at 13:59
  • no each sentence is not on its own line. it depends on the article (normally it's paragraphs). – user72423 Mar 17 '14 at 14:05
  • 2
    is R really the "right" place to solve this, you mention "when I read in some articles" which suggests that you have converted these from some other format—I'd try and solve this problem there. – Sam Mason Mar 17 '14 at 14:20
  • @SamMason Hi Sam I'm doing it as I'm meant to be practicing with R because I don't have much experience with it. I read-in the articles as I've been given them in the form of text files. I would alter them manually but there are soo many and I obviously can't find every word that's been joined up. If it can't be done in R then oh well. – user72423 Mar 17 '14 at 14:27
  • OK, I think you may have picked a "difficult" problem for practice! quite a few words in english look the same as two words concatenated. for example, 'looking' could be one word, or 'loo' and 'king'—how would the computer distinguish between the two? – Sam Mason Mar 17 '14 at 14:34
  • if the word that you checked is a word - keep it. if not - look to see if that word contains two words that exist in the english dictionary. if yes - split that word at that point. if not - keep it. that's my process so far. also the problem is to parse these text files until all i'm left with are the most important words. important as in either not common in the english language, appear frequently but not stopwords, or just obscure in general. i can only get so far with the `tm' package. – user72423 Mar 17 '14 at 14:47
  • This doesn't seem like the best task for R. Perhaps you can look into using the `aspell` function as a starting point to identify misspelled words.... – A5C1D2H2I1M1N2O1R2T1 Mar 18 '14 at 17:11

1 Answers1

0

Based on your comments, I'd use an environment which is basically a hashtable in R. Start by building a hash of all known words:

words <- new.env(hash=TRUE)
for (w in c("hello","world","this","is","a","test")) words[[tolower(w)]] <- T

(you'd actually want to use the contents of /usr/share/dict/words or similar), then we define a function that does what you described:

dosplit <- function (w) {
  if(is.null(words[[tolower(w)]])) {
    n <- nchar(w)
    for (i in 1:(n-1)) {
      a <- substr(w,1,i)
      b <- substr(w,i+1,n)
      if(!is.null(words[[tolower(a)]]) && !is.null(words[[tolower(b)]]))
        return (c(a,b))
    }
  }
  w
}

then we can test it:

test <- 'hello world, this isa test'
ll <- lapply(strsplit(test,'[ \t]')[[1]], dosplit)

and if you want it back into a space separated list:

do.call(paste, as.list(unlist(ll,use.names=FALSE)))

Note that this is going to be slow for large amounts of text, R isn't really built for this sort of thing. I'd personally use Python for this sort of task, and a compiled language if it got much larger.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • Sam I've been working on this for about 2 hours without much luck so your effort is extremely appreciated. I'm going to test what you've put and see what happens. I suspect you're right about R but at least I am learning what it is/isn't good for as I progress. – user72423 Mar 17 '14 at 17:24