0

I've been trying to read in a pipe delimited file containing 96 variables about some volunteer water quality data. Randomly within the file, there's single and double quotation marks as well as semi-colons, dashes, slashes, and likely other special characters

Name: Jonathan "Joe" Smith; Jerry; Emily; etc.

From the output of several variables (such as IsNewVolunteer), it seems that is having issues reading in the data. IsNewVolunteer should always be Y or N, but numbers are appearing and when I queried those lines it appears that the data is getting shifted. Variables that are clearly not names are in the Firstname and lastname column.

Shifted data columns

The original data format makes it a little difficult to see and troubleshoot, especially due to extra variables. I would find a way to remove them, but the goal of the work with R is to provide code that will be able to run on a dataset that is frequently updated.

Original Dataset

I've tried

read.table("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")

But that produces the following error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 132 did not have 94 elements

However, there's nothing out of the ordinary that I've noticed about line 132. I've had more success with

read.csv("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")

but that still produces offsets and errors as discussed above. Is there something I'm doing incorrectly? Any information would be helpful.

Matt
  • 7,255
  • 2
  • 12
  • 34
MSchmied
  • 49
  • 8
  • Please take a look at [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), to modify your question, with a smaller sample taken from your data (check `?dput()`). Posting images of your data or no data makes it difficult to impossible for us to help you! – massisenergy Mar 26 '20 at 12:51
  • Have you tried specifying the encoding? Something like: `read.csv("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="", encoding = "UTF-8")`. You also have multiple `||||` separators in rows 2,3,4, and 21,22 that are visible in your screenshot. You might need to edit your .csv either in R or manually to strip some of them out. – Matt Mar 26 '20 at 13:29

1 Answers1

0

I think it's one of two issues:

Encoding is either UTF-8 or UTF-16:

Try this...

read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-8)

or this...

read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-16)

Too many separators:

If this doesn't work, right-click on your .csv file and open it in a text editor. You have multiple |||| separators in rows 2,3,4, and 21,22 that are visible in your screenshot. Press CTRL+H to find and replace:

Find: ||||
Replace: |

Save the new file and try to open in R again.

Matt
  • 7,255
  • 2
  • 12
  • 34