0

I have a large number of csv data files that are located in many different subdirectories. The files all have the same name and are differentiated by the subdirectory name.

I'm trying to find a way to import them all into r in such a way that the subdirectory name for each file populates a column in the datafile.

I have generated a list of the files using list.files(), which I've called tto_refs.

head(tto_refs) 1 "210119/210115 2021-01-19 16-28-14/REF TTO-210119.D/REPORT01.CSV" "210122/210115 2021-01-22 14-49-41/REF TTO-210122.D/REPORT01.CSV"
[3] "210127/210127 2021-01-27 09-39-15/REF TTO-210127_1.D/REPORT01.CSV" "210127/210127 2021-01-27 09-39-15/REF TTO-210127_2.D/REPORT01.CSV"
[5] "210127A/210127 2021-01-28 15-57-40/REF TTO-210127A_1.D/REPORT01.CSV" "210127A/210127 2021-01-28 15-57-40/REF TTO-210127A_2.D/REPORT01.CSV"

I tried a few different methods to import the data into r, but they all had errors related to 'embedded nul(s)'.

For example, tbl <- tto_refs %>% map_df(~read.csv(.))

There were 50 or more warnings (use warnings() to see the first 50)

warnings() Warning messages: 1: In read.table(file = file, header = header, sep = sep, ... : line 1 appears to contain embedded nulls 2: In read.table(file = file, header = header, sep = sep, ... : line 2 appears to contain embedded nulls

etc.

How can I get this data into R?

Edit: the .csv files are generated from Agilent Chemstation analytical software.

The data looks like this: enter image description here

Phenomniverse
  • 309
  • 1
  • 8

1 Answers1

0

Your files are in the UTF-16 (or UCS-2) character encoding. This means that each character is represented by two bytes. Because the data only contain ASCII characters, the second byte of each character is 0.

Because R is expecting a single-byte-per-character encoding, it thinks the second byte is meant to be a null character, which should not be present in a CSV file.

In addition the files contain a byte-order-mark at the start of the first line, which is being converted to garbage. You need a UTF-16 to UTF-8 converter program. This should also remove the byte order mark (which is not required in UTF-8).

Personally I would do this using the tool iconv. If I were using Windows I would use Cygwin to install it.

for f in *.CSV
  do iconv -f UTF-16 -t UTF-8 <"$f" >"${f%.CSV}-utf8.csv"
done

If you don't like this approach there are several other tools listed as answers to this question.

Tom V
  • 4,827
  • 2
  • 5
  • 22
  • Thanks Tom. I opened one of the .csv files with excel and exported it as a UTF-8 CSV, which I was then able to import in R with read.csv. It does seem to include some funny symbols: > head(test_csv) V1 1 1,7.44819498062134,"BB ",0.0286703947931528,1.66939771175385,0.0219694482263266,"?" I don't want to have to open and export hundreds of files with excel. Is there some way to convert them en masse, ideally from within R? (I'm using a windows system, not linux). – Phenomniverse Nov 10 '21 at 20:56
  • It probably isn't UTF-8 either. Perhaps you can update the question to say how the files were generated? If it isn't confidential data then maybe you could upload one somewhere? – Tom V Nov 10 '21 at 22:31
  • Original question updated as requested. Not sure where I could upload a data file to though? – Phenomniverse Nov 10 '21 at 23:22
  • Perhaps dropbox or something similar? There are obviously no nuls in the extract above because opening it in excel has removed them and stackoverflow wouldn't allow them anyway. If you can upload the lot and put a link in a comment I can have a go at fixing it. – Tom V Nov 11 '21 at 09:41
  • Thanks Tom. Here's a link to an example of a data file: https://cloudstor.aarnet.edu.au/plus/s/r87murc2aSntarH There are many of these, so I don't want to have to do anything to them one by one, I need a solution (preferably in R) that will let me fix and import them all en masse. Cheers. – Phenomniverse Nov 14 '21 at 22:51
  • The question mark "?" field is legitimate data, its not being added by excel. When I save it as UTF-8 with excel, excel is adding "" to the front of the first line. This itself doesn't prevent the resaved file from being imported into R, although another problem is that it isn't separating the data into columns based on the commas. In the original files, the nuls are apparently in every line of data, not just where the strange symbol appears in the first line. – Phenomniverse Nov 15 '21 at 00:56
  • I suspected the UTF-16 to UTF-8 conversion would be the solution, and have already tried using the r function iconv, but it hasn't worked. > iconv('REPORT01.CSV',"UTF16","UTF8") [1] "剅偏剔〱⹃卖" > report<-read.csv('REPORT01.CSV',header=FALSE) Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : line 1 appears to contain embedded nulls 2: In read.table(file = file, header = header, sep = sep, quote = quote, : line 2 appears to contain embedded nulls ...etc – Phenomniverse Nov 15 '21 at 21:12
  • I don't know why iconv wouldn't work, but I managed to convert the file to a UTF-8 format that was readable by read.csv() using a command line tool mentioned on the page you linked to: C:\> powershell -c "Get-Content mytext.txt | Set-Content -Encoding utf8 mytext_utf8.txt" This still leaves the nonsense character at the start of the first line, but I can get rid of that. Now I just have to work out how to apply this solution to all the target files in their respective sub-folders. – Phenomniverse Nov 15 '21 at 21:37
  • The nonsense character is a byte-order-mark or BOM. Perhaps a bit of googling will help you find how to remove that too. – Tom V Nov 15 '21 at 21:39