I have a huge list of text files (50,000+) that contain normal sentences. Some of these sentences have words that have merged together because some of the endlines have been placed together. How do I go about unmerging some of these words in R?
The only suggestion I could get was here and kind of attempted something from here but both suggestions require big matrices which I can't use because I either run out of memory or RStudio crashes :( can someone help please? Here's an example of a text file I'm using (there are 50,000+ more where this came from):
Mad cow disease, BSE, or bovine spongiform encephalopathy, has cost the country dear.
More than 170,000 cattle in England, Scotland and Wales have contracted BSE since 1988.
More than a million unwanted calves have been slaughtered, and more than two and a quarter million older cattle killed, their remains dumped in case they might be harbouring the infection.
In May, one of the biggest cattle markets, at Banbury in Oxfordshire, closed down. Avictim at least in part, of this bizarre crisis.
The total cost of BSE to the taxpayer is set to top £4 billion.
EDIT: for example: "It had been cushioned by subsidies, living in an unreal world. Many farmers didn't think aboutwhat happened beyond the farm gate, because there were always people willing to buy what they produced."
See the 'aboutwhat' part. Well that happens to about 1 in every 100 or so articles. Not this actual article, I just made the above up as an example. The words have been joined together somehow (I think when I read in some articles some of them have missed spaces or my notepad reader joins the end of one line with another).
EDIT 2: here's the error I get when I use variation of what they have here replacing the created lists with read-in lists:
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 627
I've never seen that error before but it does come up here and here but no solution to it on either :(