how to remove "strange space" in R data frame?

Question

I import data from an Excel file into R. This file contains some strange space that neither R nor Excel detects it. What I mean by "strange space" is some form of double space but when I do Ctrl+F in Excel and search for double space using space bar in the key board, Excel can't detect it. I found out more about this strange space from this article:

https://www.auditexcel.co.za/blog/excel-not-recognizing-spaces/

Then the problem for me is that how can I replace this double space with proper single space in R?

I tried the following but not working:

table$var1 <- gsub(" +"," ",table$var1)
table$var1 <- gsub("  "," ",table$var1)
table$var1 <- sub("  "," ",table$var1)

Which I guess is predictable because it is probably not double space, it just looks like one.

Anyone come across anything like this before? I wish I can just manually fix the Excel file like that the author says in the URL above but I have a couple of hundreds of Excel files that I need to fix in this case :(

Is your data otherwise just ASCII or does it include special characters or other languages? — G5W, Sep 01 '19 at 01:09
Try calling `utf8ToInt` on one of these strings and see what numbers you get back. Each one represents one character. I suspect you will get a `9` for the "strange space", which represents the TAB character. — walnut, Sep 01 '19 at 01:14
@G5W That column does include some special character, but more of common ones that I saw like hyphens. However, the data is English, so is just whatever charcter you can find on keyboards, does it make sense? — aprscoder, Sep 01 '19 at 01:37
Can you add some data so that we can test it? Add `dput(table)`. — Ronak Shah, Sep 01 '19 at 02:29
@RonakShah Sorry I am a bit new. How can I add table to this question? — aprscoder, Sep 01 '19 at 04:49
@uneven_mark Thanks for your comment. I think that is highly possible that the space is a TAB character. What do you mean by calling out utf8ToInt ? — aprscoder, Sep 01 '19 at 04:51
If your dataframe is called `table` use `dput(table)` and update your post. — Ronak Shah, Sep 01 '19 at 04:51
@uneven_mark when i do this utf8ToInt(table$var1[11]) [1] 76 79 78 71 160 160 74 69 84 84 89 — aprscoder, Sep 01 '19 at 05:52
@uneven_mark do you know what is 160?? That one is the double space that I am seeing — aprscoder, Sep 01 '19 at 05:53
@aprscoder That is the unicode unit [`U+00A0 NO BREAK SPACE`](https://en.wikipedia.org/wiki/Non-breaking_space), which is considered neither a space not whitespace and has other variations as well, so it is a bit harder to filter. See duplicate above. — walnut, Sep 01 '19 at 07:39

dbo · Answer 1 · 2019-09-01T06:01:55.300

1

How about using trim whitespace with trimws():

x <- "  Some text. "
x <- trimws(x)
x
[1] "Some text."

then adding a space back in:

x <- paste0(" ", x)
x
[1] " Some text."

In your case:

table$var1  <- trimws(table$var1)
table$var1  <- paste0(" ", table$var1)

edited Sep 01 '19 at 06:01

answered Sep 01 '19 at 01:17

dbo

1,174
1
11
19

Thanks dbo. Trimming the white space doesn't work. I just tried it – aprscoder Sep 01 '19 at 01:35

aprscoder · Answer 2 · 2019-09-01T06:18:35.443

-1

Guess What?! I found an answer in the end from this post: Where is this whitespace hiding?

@uneven_mark You gave me the right direction. Using the post from the above forum, this is what I did:

trim <- function(x) gsub("^\\s+|\\s+$", "", x)
clean <- function(text) {
  gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

trim(clean(table1$var1))

Thank you so much guys :)

edited Sep 01 '19 at 06:18

answered Sep 01 '19 at 06:09

aprscoder

11
2

I am a bit surprised that this solved your problem. Your example from the comments does not get its non-breaking spaces replaced by this: `utf8ToInt(trim(clean(intToUtf8(c(76,79,78,71,160,160,74,69,84,84,89)))))` outputs `[1] 76 79 78 71 160 160 74 69 84 84 89` – walnut Sep 01 '19 at 07:37

how to remove "strange space" in R data frame?

2 Answers2