-1

I import data from an Excel file into R. This file contains some strange space that neither R nor Excel detects it. What I mean by "strange space" is some form of double space but when I do Ctrl+F in Excel and search for double space using space bar in the key board, Excel can't detect it. I found out more about this strange space from this article:

https://www.auditexcel.co.za/blog/excel-not-recognizing-spaces/

Then the problem for me is that how can I replace this double space with proper single space in R?

I tried the following but not working:

table$var1 <- gsub(" +"," ",table$var1)
table$var1 <- gsub("  "," ",table$var1)
table$var1 <- sub("  "," ",table$var1)

Which I guess is predictable because it is probably not double space, it just looks like one.

Anyone come across anything like this before? I wish I can just manually fix the Excel file like that the author says in the URL above but I have a couple of hundreds of Excel files that I need to fix in this case :(

aprscoder
  • 11
  • 2
  • Is your data otherwise just ASCII or does it include special characters or other languages? – G5W Sep 01 '19 at 01:09
  • Try calling `utf8ToInt` on one of these strings and see what numbers you get back. Each one represents one character. I suspect you will get a `9` for the "strange space", which represents the TAB character. – walnut Sep 01 '19 at 01:14
  • @G5W That column does include some special character, but more of common ones that I saw like hyphens. However, the data is English, so is just whatever charcter you can find on keyboards, does it make sense? – aprscoder Sep 01 '19 at 01:37
  • 1
    Can you add some data so that we can test it? Add `dput(table)`. – Ronak Shah Sep 01 '19 at 02:29
  • @RonakShah Sorry I am a bit new. How can I add table to this question? – aprscoder Sep 01 '19 at 04:49
  • @uneven_mark Thanks for your comment. I think that is highly possible that the space is a TAB character. What do you mean by calling out utf8ToInt ? – aprscoder Sep 01 '19 at 04:51
  • If your dataframe is called `table` use `dput(table)` and update your post. – Ronak Shah Sep 01 '19 at 04:51
  • @uneven_mark when i do this utf8ToInt(table$var1[11]) [1] 76 79 78 71 160 160 74 69 84 84 89 – aprscoder Sep 01 '19 at 05:52
  • @uneven_mark do you know what is 160?? That one is the double space that I am seeing – aprscoder Sep 01 '19 at 05:53
  • @aprscoder That is the unicode unit [`U+00A0 NO BREAK SPACE`](https://en.wikipedia.org/wiki/Non-breaking_space), which is considered neither a space not whitespace and has other variations as well, so it is a bit harder to filter. See duplicate above. – walnut Sep 01 '19 at 07:39

2 Answers2

1

How about using trim whitespace with trimws():

x <- "  Some text. "
x <- trimws(x)
x
[1] "Some text."

then adding a space back in:

x <- paste0(" ", x)
x
[1] " Some text."

In your case:

table$var1  <- trimws(table$var1)
table$var1  <- paste0(" ", table$var1)
dbo
  • 1,174
  • 1
  • 11
  • 19
-1

Guess What?! I found an answer in the end from this post: Where is this whitespace hiding?

@uneven_mark You gave me the right direction. Using the post from the above forum, this is what I did:

trim <- function(x) gsub("^\\s+|\\s+$", "", x)
clean <- function(text) {
  gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

trim(clean(table1$var1))

Thank you so much guys :)

aprscoder
  • 11
  • 2
  • I am a bit surprised that this solved your problem. Your example from the comments does not get its non-breaking spaces replaced by this: `utf8ToInt(trim(clean(intToUtf8(c(76,79,78,71,160,160,74,69,84,84,89)))))` outputs `[1] 76 79 78 71 160 160 74 69 84 84 89` – walnut Sep 01 '19 at 07:37