4

I am trying to process a text file. Overall I have a Corpus that I would like to analyze. In order to use the tm package (a text mining package in R) to create a Corpus object I need to make this paragraph to become one gigantic vector in order to be read properly.

I have a paragraph

          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

I've used both the scan and readLine methods and it processes the text like this:

[28] " commercial exploitation over the past two hundred years drove "
[29] " the great mysticete whales to near extinction variation in "
[30] " the sizes of populations prior to exploitation minimal "

Is there a way to get rid of the line breaks? Or to read the text file as one gigantic vector?

All of the solution posted have been great so far thank you.

Zaynaib Giwa
  • 5,366
  • 7
  • 21
  • 26

3 Answers3

6

This will read the entire file into a length one character vector.

x <- readChar(file, file.info(file)$size)

Jim
  • 4,687
  • 29
  • 30
  • This solution sounds really good. However, how can you write the same output into the file ? I used `write` command and it would have empty line after each line that has text. – Jd Baba Nov 18 '17 at 01:08
  • @JaneshDevkota to write a character vector into a file try using `cat` e.g. `cat(charVector, file = "textfile.txt", append = F, fill = F)`. When append is false it will overwrite the file. If fill is false no new lines or carriage returns will be added (including EOL & EOF) which may be an issue for some programs. But all the control is in your hands – Chris Njuguna May 01 '18 at 18:24
4

If there is too much processing to be done on the file, it may take a long time to read. You may consider reading it in unchanged and then make the changes. The stringi package has a function for this particular operation. And the authors write in C so their functions are nice and fast

So assuming you've read in the file, and named it txt,

library(stringi)
stri_flatten(txt)
# [1] "          Commercial exploitation over the past two hundred years drove                  \n          the great Mysticete whales to near extinction.  Variation in                   \n          the sizes of populations prior to exploitation, minimal                        \n          population size during exploitation and current population                     \n          sizes permit analyses of the effects of differing levels of                    \n          exploitation on species with different biogeographical                         \n          distributions and life-history characteristics."

And the string is still in the same format, only flattened. To check that we can look at cat

cat(stri_flatten(txt))
          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks Richard! I didn't know that there was a stringi package. – Zaynaib Giwa Dec 07 '14 at 20:25
  • @user3426338 - I would check it out. It takes a minute to learn the functions as there are quite a few, they are all blazing fast. – Rich Scriven Dec 08 '14 at 22:07
  • Thanks. I decided just to do it with the linux command line. I have about 5,700 files to preprocess and it was just the easiest way [Link](http://unix.stackexchange.com/questions/171994/how-to-get-portion-of-lines-from-all-txt-files-in-a-directory/172004?noredirect=1#comment284275_172004) But this is good knowledge for the future. – Zaynaib Giwa Dec 09 '14 at 00:29
3

I had the same problem a while ago and found a workaround: to read the individual lines and then paste them together, removing the "\n" newlines:

filename <- "tmp.txt"
paste0(readLines(filename),collapse=" ")

If you need the newlines, then you can read the file as a character string

readChar(filename,1e5)

specifying a sufficiently large number of characters (100000 in this case).

renato vitolo
  • 1,744
  • 11
  • 16