Tidy dataframes with special characters

Question

My dataset in csv format (available here) is as follows:

I tried to import the data into R with the following document:

library(tidyverse)
data <- read_csv("<Path to file>\\Sample.csv")

When I tried to extract column with data$, I get the error:

Error in gsub(reStrip, "", completions, perl = TRUE) : 
  input string 4 is invalid UTF-8

Column D of my data contains both numbers and special characters * and the Pilcrow symbol. The number in the last last row for this column is 66.1 but is followed by #. Similarly, column I contains * and the Pilcrow symbol.

I want to clean the data so that for Column D and I, rows with * and the Pilcrow symbol are replaced with NA. In addition, "66.1 #" is converted to "66.1".

My desired table is as below:

I would like to know how I could import the csv file into R and clean it to the desired format in R. Preferable with tidyverse.

My guess is this isn't anything to do with any tidyverse packages, but the fact that you're using non-standard (arguably invalid) column names. Fix those first: remove spaces and special characters. Lots of SO posts cover that already, including [this one](https://stackoverflow.com/q/10688137/5325862). It's really preferable that you create a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) within the question, rather than us having to download external data — camille, Dec 25 '19 at 15:56
The data are downloaded from external database, I INTENTIONALLY retain the column names as they are. I have many datasets. It would be more efficient if I could work with them in R directly without having to rename columns for each dataset. As I mentioned, I prefer solutions from tidyverse, but any other solution is fine. With the data provided, anyone could replicate my error. Although the data is deposited elsewhere, the error is undoubtedly reproducible. In addition, the dataset provided is a minimum dataset for replication, which is in line with the spirit of a reproducible example. Thx — Patrick, Dec 25 '19 at 17:21
The CSV at your link `https://raw.githubusercontent.com/Patricklv/Dataframe-with-special-characters/master/Sample.csv` does not seem to have the Pilcrow symbol -- I see question marks there. Does it look different for you? — Jon Spring, Dec 25 '19 at 19:16
Thank you for the notice. The csv file has been updated. This new version contains the Pilcrow symbol. Thank you. — Patrick, Dec 26 '19 at 14:27
The pilcrow in the values isn't a problem for `readr::read_csv(path, col_names = F)`. You can even read in just the column names as text with no problem. But no version of naming I've tried so far will let those names be set on the resulting dataframe (`setNames`, `purrr::set_names`, `magrittr::set_colnames`). — Brian, Dec 26 '19 at 19:29
I even tried using `dplyr::rename` on each column individually with no luck. — Brian, Dec 26 '19 at 19:35
My comment about it not being tidyverse specific was meant as the reason why I removed the [tag:tidyverse] tag, since the questions doesn't match the guidance given for when to use that tag. Also, didn't mean to change the column names in the files, since that would be bad form anyway. You said the problem arises when you try `data$`, so what I meant was to fix the column names before doing anything like that — camille, Dec 27 '19 at 16:11
Thank you for your suggestions and help. I will pay more attention to the use of the tage for my future post. Thanks again. — Patrick, Dec 28 '19 at 04:18

Chris Ruehlemann · Accepted Answer · 2019-12-27T16:26:20.300

Here's a simple solution.

Read-in the file as a .txt file:

Edit 1: As the data contains the hashtag # in some cells, you need to include the argument comment.char="" so that R will display the data following it (it will however still treat it as the onset of a comment):

df <- read.table("[Your path]/Sample.txt", 
                   header = T, fill = T, quote="", sep="\t", comment.char="")

As others, I find your column headers very unpractical and have taken the liberty to change them to the first eleven letters of the alphabet (merely for visual reasons; you will have reasons to retain your headers):

colnames(df) <- LETTERS[1:11]
df
                                    A     B    C       D     E     F           G       H     I      J    K
1       Adams County, Nebraska(6,10)  31001 ***    62.1  51.4  74.6          25  stable  -5.2  -32.3  32.8
2   Adams County, North Dakota(6,10)  38001 ***       *      *    *  3 or fewer       *     *      *     *
3 Aiken County, South Carolina(6,10)  45003 ***    55.9  51.5  60.6         124  stable  -2.3   -8.8   4.6
4        Aitkin County, Minnesota(6)  27001  ***      ¶     ¶     ¶           ¶       ¶     ¶      ¶     ¶
5   Albemarle County, Virginia(6,10)  51003 ***    49.4  44.1  55.3          64  stable  -2.9  -18.1  15.2
6      Alcona County, Michigan(6,10)  26001 ***  66.1 #  51.2  86.7          17  stable  -3.6  -26.5  26.3

The desired transformations can be done in many ways; one simple way is this (transforms the two columns separately):

df[,4] <- gsub("\\*|¶", NA, df[,4])
df[,9] <- gsub("\\*|¶", NA, df[,9])

A slightly more complex but more economical way is this (transforms the two columns in one go):

df[,c(4,9)] <- lapply(df[,c(4,9)], function(x) gsub("\\*|¶", NA, x))

Edit 2: If you want to get rid of the hashtag in the data, this will do it for column D:

df[,4] <- gsub(" #", "", df[,4])

If you want to get rid of # in a range of columns, subset dfon that range and use lapply:

df[,c([range])] <- lapply(df[,c([range])], function(x) gsub(" #", "", x))

The output is this:

df
                                    A     B    C     D     E     F           G       H     I      J    K
1       Adams County, Nebraska(6,10)  31001 ***  62.1  51.4  74.6          25  stable  -5.2  -32.3  32.8
2   Adams County, North Dakota(6,10)  38001 ***   <NA>     *    *  3 or fewer       *   <NA>     *     *
3 Aiken County, South Carolina(6,10)  45003 ***  55.9  51.5  60.6         124  stable  -2.3   -8.8   4.6
4        Aitkin County, Minnesota(6)  27001  ***  <NA>    ¶     ¶           ¶       ¶   <NA>     ¶     ¶
5   Albemarle County, Virginia(6,10)  51003 ***  49.4  44.1  55.3          64  stable  -2.9  -18.1  15.2
6      Alcona County, Michigan(6,10)  26001 ***  66.1  51.2  86.7          17  stable  -3.6  -26.5  26.3

Thank you Chris. I am fine with renaming the columns with code in R. Your solution solved the issue of converting * and the pilcrow symbol to NA. But all values are missing in the last row starting from column E. Any further idea? This is important since some rows in the dataset downloaded from database has " #" (with a space before the # symbol) after the number in column D. — Patrick, Dec 27 '19 at 05:08
The data starting from column E is missing because the datum string in column E ends with `#`! That chararacter is understood by R as the onset of a _comment_, which is neither treated as data nor code. If you delete the character in the .txt file, the read-in is complete. — Chris Ruehlemann, Dec 27 '19 at 10:35
If you have many cells with # in the data, there's a better solution: see my **Edit** — Chris Ruehlemann, Dec 27 '19 at 11:07

Tidy dataframes with special characters

1 Answers1