"incomplete final line found by readTableHeader" when using read.delim() on a tab-delimited file with Chinese character

Question

I got this "incomplete final line found by readTableHeader" error message when using read.delim() to read in a tab-delimited text file. There are Traditional Chinese characters in the header and content, so I am already using alternative encoding, like this:

kg = read.delim("KG_EDB_20150505.csv",fileEncoding="UTF-16LE")

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'KG_EDB_20150505.csv'

I have read other posts with similar issues, e.g.:

'Incomplete final line' warning when trying to read a .csv file into R In read.table(): incomplete final line found by readTableHeader

But unfortunately the suggested solutions in these posts cannot solve the problem.

A summary of what were tried etc:

Pressing ENTER at the last line of the text file: same error
Trimming the text file into header + 1 single of data, then make sure there is a new line (ENTER) between the line for header and the content: same error
Trimming the text file until only the header is left, then copy&paste the header onto the next line and use it to pretend as a line of data. Add a new line (ENTER) after the fake line of data: WORKS! Chinese is all garbage, but then I do not need those anyway.
Remove the trailing new line (ENTER) in #3: same error, but can read 1 line of fake data into the data.frame.
Open in Excel directly: works, but not the workflow I want.

So what gives?

Is there a way I can read in such file?

or

Is there a way to massage the file (preferably in R) and then read it in?

The file is here:

https://dl.dropboxusercontent.com/u/5860015/KG_EDB_20150505.csv

It was from a government webpage here:

http://www1.map.gov.hk/gih3/view/index.jsp
(Map Tools > Data Download > Kindergarten-cum-child Care Centres)

Many thanks in advance!

Update:

By a stroke of luck, I isolated an offending character inside the text file, namely this Chinese character "稚". It may not be the only one, but if I add it to the file in #3, same error again. I do not know what is special about this character and I do no need any info in the text file in Chinese anyway.

So now there are more questions:

Is there a way to skip reading this offending character?

or

Is there a way in R to replace this offending character in the file, before reading in the text file?

It's NOT an error. Only a warning. Ignore it or add a final carriage return to your datafile if it annoys you too much. — IRTFM, Jun 09 '15 at 04:53
I added a carriage return to the the last line. After reading into R, there is 0 observation of 37 variables. Error? — TerenceLam, Jun 09 '15 at 05:24
@BondedDust, please, really, read my post again. For example, if I have not looked at the file with a text editor, how did I edit the text file to test it? — TerenceLam, Jun 09 '15 at 05:41
All Chinese characters were displayed correctly in the text editor. — TerenceLam, Jun 09 '15 at 05:54
Looks like I misunderstood the situation. The link to the other question made me think this was just an ordinary warning that people always think is an error. There are two tested solutions offered below. I had no difficulty using a Mac. It's not a csv file, although when I ran your code, I get no error. — IRTFM, Jun 09 '15 at 05:55

IRTFM · Answer 1 · 2015-06-09T05:50:53.153

It's full of Chinese characters (every other field in fact).

First line:

"ENGLISH CATEGORY" "中文類別" "ENGLISH NAME" "中文名稱" "ENGLISH ADDRESS" "中文地址" "LONGITUDE" "經度" "LATITUDE" "緯度" "EASTING" "坐標東" "NORTHING" "坐標北" "STUDENTS GENDER" "就讀學生性別" "SESSION" "學校授課時間" "DISTRICT" "分區" "FINANCE TYPE" "資助種類" "SCHOOL LEVEL" "學校類型" "OPENING HOURS" "開放時間" "TELEPHONE" "聯絡電話" "FAX NUMBER" "傳真號碼" "EMAIL ADDRESS" "電郵地址" "WEBSITE" "網頁" "RELIGION" "宗教"

And my editor thinks it is UTF-16 and that it is "Little Endian".

Unless you are set up with the right fonts and understand the ins and outs of encodings, it is much easier to use an external editor, especially since you say you do not want the info that is in the Chinese fields. I succeeded with the TextWrangler editor from Bare Bones Software. It's the free version of their more full featured editor, but it has the capacity to remove non-ASCII characters and save as UTF-8 encoded file.

> inp <- read.table("~/Downloads/KG_EDB_20150505.txt", header=TRUE)
> str(inp)
'data.frame':   1385 obs. of  36 variables:
 $ ENGLISH.CATEGORY: Factor w/ 1 level "Kindergartens": 1 1 1 1 1 1 1 1 1 1 ...
 $ X               : logi  NA NA NA NA NA NA ...
 $ ENGLISH.NAME    : Factor w/ 1368 levels "A-ONE KINDERGARTEN",..: 137 38 835 714 858 551 455 533 1073 396 ...
 $ X.1             : Factor w/ 68 levels "","-()","()",..: 5 3 3 5 3 3 3 3 3 3 ...
 $ ENGLISH.ADDRESS : Factor w/ 562 levels "(INCLUDING 1-STOREY SCHOOL EXTENSION) 23 NAM LONG SHAN ROAD ABERDEEN HONG KONG",..: 448 40 34 316 396 55 326 160 273 483 ...
 $ X.2             : Factor w/ 294 levels "","()","()29",..: 257 1 21 1 1 112 1 59 1 289 ...
 $ LONGITUDE       : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ X.3             : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ LATITUDE        : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ X.4             : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ EASTING         : num  836221 828924 834914 818325 828492 ...
 $ X.5             : num  836221 828924 834914 818325 828492 ...
 $ NORTHING        : num  821002 826433 820623 835893 840814 ...
 $ X.6             : num  821002 826433 820623 835893 840814 ...
 $ STUDENTS.GENDER : Factor w/ 2 levels "CO-ED","GIRLS": 1 1 1 1 1 1 1 1 1 1 ...
 $ X.7             : logi  NA NA NA NA NA NA ...
 snipped.

The fields that had Chinese in the header are all now blank. It's NOT a csv file.... no commas. If I were doing it again for myself I'd use stringsAsFactors =FALSE

It's also possible to input the file with the correct encoding. This works on the original file with no editing at all:

> inp2 <- read.table("~/Downloads/KG_EDB_20150505.csv", header=TRUE, fileEncoding="UTF-16")
> str(inp2)
'data.frame':   1385 obs. of  36 variables:
 $ ENGLISH.CATEGORY: Factor w/ 1 level "Kindergartens": 1 1 1 1 1 1 1 1 1 1 ...
 $ 中文類別        : Factor w/ 1 level "幼稚園": 1 1 1 1 1 1 1 1 1 1 ...
 $ ENGLISH.NAME    : Factor w/ 1368 levels "A-ONE KINDERGARTEN",..: 137 38 835 714 858 551 455 533 1073 396 ...
 $ 中文名稱        : Factor w/ 1355 levels "","DISCOVERY BAY INTERNATIONAL SCHOOL (A.M.)",..: 1186 507 854 630 64 134 1298 147 520 1256 ...
 $ ENGLISH.ADDRESS : Factor w/ 562 levels "(INCLUDING 1-STOREY SCHOOL EXTENSION) 23 NAM LONG SHAN ROAD ABERDEEN HONG KONG",..: 448 40 34 316 396 55 326 160 273 483 ...
 $ 中文地址        : Factor w/ 554 levels "34 PRICE ROAD HONG KONG",..: 32 395 51 259 173 37 58 28 176 370 ...
 $ LONGITUDE       : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ 經度            : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ LATITUDE        : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ 緯度            : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ EASTING         : num  836221 828924 834914 818325 828492 ...
 $ 坐標東          : num  836221 828924 834914 818325 828492 ...
 $ NORTHING        : num  821002 826433 820623 835893 840814 ...
 $ 坐標北          : num  821002 826433 820623 835893 840814 ...
snipped.

I tried your suggested UTF-16 encoding and got this error message: Error in read.table("KG_EDB_20150505.csv", header = TRUE, fileEncoding = "UTF-16") : more columns than column names In addition: Warning messages: 1: In read.table("KG_EDB_20150505.csv", header = TRUE, fileEncoding = "UTF-16") : incomplete final line found by readTableHeader on 'KG_EDB_20150505.csv' 2: In read.table("KG_EDB_20150505.csv", header = TRUE, fileEncoding = "UTF-16") : incomplete final line found by readTableHeader on 'KG_EDB_20150505.csv' — TerenceLam, Jun 09 '15 at 05:59
What encoding does your editor report? And is that on the original file or the one you saved? — IRTFM, Jun 09 '15 at 06:00
I am more interested in your locale setting in R. What Sys.getlocale() reports in your end? — TerenceLam, Jun 09 '15 at 06:02
Ordinary US locale: `Sys.getlocale() # [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"`. Now, answer the questions, please. — IRTFM, Jun 09 '15 at 06:07
Have you tried `fileEncoding="UCS-2"`? My suspicion is either your browser or your text editor have coerced that file into a form different than what it started life as. — IRTFM, Jun 09 '15 at 06:31
I suspected that too. I would need more machines with different OS / browsers combo to test this theory. Tried UCS-2 among others (yeah UTF-16LE too, as shown), same error message, same 0 observations data.frame. — TerenceLam, Jun 09 '15 at 07:12
Oh it is unlikely it was the browser, cos you are using the same file I put in Dropbox. The only difference is that you are using a Mac. I am on Win 7 btw. — TerenceLam, Jun 09 '15 at 07:23

"incomplete final line found by readTableHeader" when using read.delim() on a tab-delimited file with Chinese character

1 Answers1