0

Short version:

I have many many .txt files that have some unwanted characters â and dotted about everywhere after having used regex to remove URLs and whitespace. I need to remove all of these from all the files.

These â were not present before cleaning the files, they are being produced as a result of the cleaning.

Long version

I found a regex that works for my text, and the URLs are being removed. First of all, my cleaning process (the commented out lines are other things I have tried):

clean_file <-  sapply(curr_file, function(x) {
    gsub("&amp;", "&", x) %>%
        gsub("http\\S+\\s*", "", .) %>%
        gsub("[^[:alpha:][:space:]&']", "", .) %>%
        #gsub("[^[:alnum:][:space:]\\'-]", "", .) %>%
        stripWhitespace() %>%
        gsub("^ ", "", .) %>%
        gsub(" $", "", .)
        #gsub("â", "", .)
})

Example input text (each line is a character string):

Gluskin’s Rosenberg: Don’t Bet on a Bear Market for Treasurys -  Rising Treasury yields?... http://j.mp/UVM31t   #FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market: Large investment asset losses can be… http://goo.gl/fb/cgzGv 
Thank You http://pages.townhall.com/campaign/will-2013-be-a-bull-or-bear-market …  via @townhallcom
Calif. GHG cap-and-trade: a bull or a bear market? http://bit.ly/VG9DTr 

Unfortunately it doesn't appear here, but there are also some non-standard characters in the text above, namely \302. R can see them just as that:

> x = _                                   <-- appears as an underscore in my text editor
Error: object '\302' not found

It may be that they come from shift+space, as hinted here, however they are an artefact of my data and so I need to remove them - I cannot prevent them.

Output produced (visible in saved .txt file):

Gluskinâs Rosenberg Donât Bet on a Bear Market for Treasurys - Rising Treasury yields FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can beâ
Thank You â via townhallcom
Calif GHG cap-and-trade a bull or a bear market

Output as visible in R console:

> head(clean_file)
      ..text                                                                                                        
[1,] "Nice bear market rally for the Lakers NBA"                                                                    
[2,] "Commented on StockTwits your scenario is entirely possible and as long as SPX doesn't exceed the bear market" 
[3,] "Gluskin\342s Rosenberg Don\342t Bet on a Bear Market for Treasurys Rising Treasury yields FederalReserve"           
[4,] "Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can be\342"
[5,] "Thank You \342 via townhallcom"
[6,] "Calif GHG capandtrade a bull or a bear market"

Before I thought of this as an encoding issue, simply replacing the â characters failed with this:

gsub("â", "", myText)

I have tried a few other solutions to changing the encoding of a the file (found in the solutions here) I tried to write to file forcing the encoding of the output with fileEncoding = 'ascii' instead of the default utf-8 (I believe), but the ascii simply gave me warnings and truncated many lines, leaving some completely empty. There also didn't seem to be any correlation between those lines removed and where the â character had previously appeared.

Can I try to prevent these characters from being created when writing in the future?

Community
  • 1
  • 1
n1k31t4
  • 2,745
  • 2
  • 24
  • 38

1 Answers1

5

This keeps only characters from hex 0 to hex 7f where Lines is a character vector whose components are the lines of your file:

gsub("[^\\x{00}-\\x{7f}]", "", Lines, perl = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • For anyone wondering which characters are left in the file, see [this link](http://ascii.cl/). I can't thank you enough, _G. Grothendieck_ :-) – n1k31t4 Nov 06 '15 at 18:29
  • Here is maybe [a better link](http://www.asciitable.com/) to show the meaning of the first 31 ascii characters, i.e. hex {0-1F} – n1k31t4 Nov 07 '15 at 10:29