4

could you please help me?

I am trying to load large TSV file (4 mln rows), and using for that 'fread' (enormous speed :)

Problem is that when reaching certain line all program crashes. Last message from verbose is "Bumping column 12 from INT64 to REAL on data row 2220004, field contains '0.54'"

I tried to copy only till that row with 'skip' option - it worked fine, but after when I was trying to copy last rows it immediately thrown another error: Unexpected character ("Ам) ending field 5 of line 2220005

After I tried to disable header, to drop the 12th column, to input column classes - nothing worked.

Any ideas how to overcome this issue?

My code:

library(data.table)
movies <- fread('avito_train.tsv', verbose=TRUE, nrows=2220002)
movies2 <- fread('avito_train.tsv', verbose=TRUE, sep="\t", skip=2220004, colClasses=c("integer", "character", "character","character","character", "character","integer","integer","integer","integer","integer","real", "numeric")) 

Oh if it change something the text withing tsv file is in slavic.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Darius
  • 596
  • 1
  • 6
  • 22
  • 1
    Have you looked at the offending line? – Roland Jun 26 '14 at 07:50
  • 1
    I'd consider using a base option (*i.e.* `read.delim`). Four million is definitely do-able: just use `as.is=TRUE` and try to specify the number of rows. – Hugh Jun 26 '14 at 07:52
  • thanks for the responses. I am having hard time while opening that file and trying to find exact line (used MacVim). Still no success. Read.delim works, to some extent, but it takes huge amount of time. Couple hours wasn't enough. – Darius Jun 26 '14 at 08:10
  • 1
    How is this RStudio related? – Roman Luštrik Jun 26 '14 at 08:24
  • Hi Roman, if you mean that it might be software problem, so tried in pure R, crashes in the same way, but without explanations. – Darius Jun 26 '14 at 08:30
  • 1
    Any chance you can send me the file? Over email to `maintainer("data.table")` or via an online service. – Matt Dowle Jun 26 '14 at 11:11
  • Hi Matt, sure. Well actually as the zipped file is 800mb, I can send only the link :) By the way thank you for developing very useful package. I was able to open the file with 'emacs', and to delete the row. After it worked like a charm. 4mln rows in less than 3 minutes. Surprisingly, I didn't notice anything suspicious regarding that row. Also can you tell me is it already possible with 'fread' to identify specific row, which you want to skip? – Darius Jun 26 '14 at 12:25
  • 3
    Thanks. Downloading now. No you can't skip a particular row but nice idea - filed as [#711](https://github.com/Rdatatable/data.table/issues/711). – Matt Dowle Jun 26 '14 at 15:07
  • 1
    @Darius Hey, it appears we are working on the same competition and we're hitting some of the same issues. If we teamed up, maybe we could help eachother :) Ref. http://stackoverflow.com/questions/24432291/unknown-error-in-fread-of-large-file?noredirect=1#comment37805982_24432291 – user1477388 Jun 26 '14 at 15:47
  • @MattDowle wow, you have so many requests for your project updates, how you are able to handle them? Well thanks again for the nice package. – Darius Jun 27 '14 at 07:21
  • @user1477388 Oh nice, hope this thread helped you to solve the issues. Unfortunately i am not the right person to collaborate, as this competition for me its just for learning to deal with big data, and for having some fun. Thus cannot invest enough time in order to compete properly. But maybe in the future? ;) – Darius Jun 27 '14 at 07:25
  • @Darius I was going to give you the same disclaimer, this is my first competition. I was actually hoping to learn from you, but still, perhaps we could learn from each other. I have written sentiment analysis programs and spam filters in the past, but currently, I am having some encoding issues with the Russian text. Haven't done anything with Russian text until now. – user1477388 Jun 27 '14 at 12:24

1 Answers1

4

It works fine for me using the latest version of data.table from GitHub. Two recent changes in README may have solved it :

fread() :
* now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting :
fread and a quoted multi-line column value
* now accepts trailing backslash in quoted fields. Thanks to user2970844 for highlighting :
fread and column with a trailing backslash

Here is the output (on my slow netbook with 4GB RAM which struggles but gets there) :

$ file avito_train.tsv 
avito_train.tsv: UTF-8 Unicode text, with very long lines

> DT = fread("Downloads/avito_train.tsv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 2.915 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t'
Found 13 columns
First row with 13 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 3995804
Subtracted 1 for last eol and any trailing empty lines, leaving 3995803 data rows
Type codes (   first 5 rows): 1444441111113
Type codes (+ middle 5 rows): 1444441111113
Type codes (+   last 5 rows): 1444441111113
Type codes: 1444441111113 (after applying colClasses and integer64)
Type codes: 1444441111113 (after applying drop or select (if supplied)
Allocating 13 column slots (13 - 0 dropped)
Read 3995803 rows and 13 (of 13) columns from 2.915 GB file in 00:10:49
  82.590s ( 13%) Memory map (rerun may be quicker)
   2.930s (  0%) sep and header detection
  68.290s ( 11%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   3.550s (  1%) Allocation of 3995803x13 result (xMB) in RAM
 491.590s ( 76%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.080s (  0%) Changing na.strings to NA
 649.030s        Total

.

> head(DT)
     itemid    category               subcategory                 title
1: 10000010   Транспорт     Автомобили с пробегом     Toyota Sera, 1991
2: 10000025      Услуги         Предложения услуг         Монтаж кровли
3: 10000094 Личные вещи Одежда, обувь, аксессуары      Костюм Steilmann
4: 10000101   Транспорт     Автомобили с пробегом      Ford Focus, 2011
5: 10000132   Транспорт     Запчасти и аксессуары       Турбина 3.0 Bar
6: 10000152   Транспорт     Автомобили с пробегом ВАЗ 2115 Samara, 2005
                                                                                                                                                                                                                                                                                                description
1:                                                                                    Новая оригинальная линзованая оптика на ксеноне (ближний, дальний), новые задние фонари, новые 16-е диски, новая передняя резина, задние с небольшим износом. ^p Срочно! Торг! ^p Актуально, пока висит объявление!!!
2:                                                                                                                                                                                                                                                     Выполняем  монтаж кровли фальцевой ^p Тел:8@@PHONE@@
3:                                                                                                                               Юбка и топ из панбархата. Под топ  трикотажная майка. Vобразный вырез спереди и сзади. На юбке по подолу мягкий волан. Длина приблизительно по колено (+3-4 см). Размер 40
4: Автомобиль в отличном техническом состоянии, все работает, включается, переключается и т.д. Нареканий по подвеске, коробке и двигателю нет. Два комплекта резины зима/лето в отличном состоянии. Продается СРОЧНО в связи с семейными обстоятельствами!!! Возможен ТОРГ при осмотре в разумных пределах.
5:                                                                                                                                                                                                                                   Продам турбину на двигатель V-6 . V-8 и мощнее 16 клапанов и выше.....
6:                                                                                                                                                                                           Автомабиль вхорошем состаянием НЕ ГНЕЛАЯ борт комп музыка званите всё раскажу званите влюбое время 8 @@PHONE@@
                                                                                                                                                                                                                                                                                                                                            attrs
1:        {""Год выпуска"":""1991"", ""Тип кузова"":""Купе"", ""Пробег"":""10 000 - 14 999"", ""Коробка передач"":""Автоматическая"", ""Объем двигателя"":""1.5"", ""Тип двигателя"":""Бензиновый"", ""Марка"":""Toyota"", ""Модель"":""Sera"", ""Цвет"":""Оранжевый"", ""Привод"":""Передний"", ""Руль"":""Правый"", ""Состояние"":""Не битый""}
2:                                                                                                                                                                                                                                                                                                     {""Вид услуги"":""Ремонт, строительство""}
3:                                                                                                                                                                                                                                            {""Вид одежды"":""Женская одежда"", ""Предмет одежды"":""Платья и юбки"", ""Размер"":""46–48 (L)""}
4:              {""Марка"":""Ford"", ""Модель"":""Focus"", ""Год выпуска"":""2011"", ""Пробег"":""80 000 - 84 999"", ""Тип кузова"":""Седан"", ""Цвет"":""Чёрный"", ""Объём двигателя"":""1.6"", ""Коробка передач"":""Механическая"", ""Тип двигателя"":""Бензиновый"", ""Привод"":""Передний"", ""Руль"":""Левый"", ""Состояние"":""Не битый""}
5:                                                                                                                                                                                                                                                                              {""Вид товара"":""Запчасти"", ""Тип товара"":""Для автомобилей""}
6: {""Марка"":""ВАЗ (LADA)"", ""Модель"":""2115 Samara"", ""Год выпуска"":""2005"", ""Пробег"":""140 000 - 149 999"", ""Тип кузова"":""Седан"", ""Цвет"":""Синий"", ""Объём двигателя"":""1.5"", ""Коробка передач"":""Механическая"", ""Тип двигателя"":""Бензиновый"", ""Привод"":""Передний"", ""Руль"":""Левый"", ""Состояние"":""Не битый""}
    price is_proved is_blocked phones_cnt emails_cnt urls_cnt close_hours
1: 150000        NA          0          0          0        0        0.03
2:      0        NA          0          1          0        0       22.38
3:   1500        NA          0          0          0        0        0.41
4: 365000        NA          0          0          0        0        8.87
5:   5000        NA          0          0          0        0       11.82
6:      0        NA          0          1          0        0       22.55

.

> tail(DT)
     itemid            category                subcategory                                              title
1: 99999929     Для дома и дачи     Ремонт и строительство             Алюминиевые раздвижки профиль проведал
2: 99999962           Транспорт      Запчасти и аксессуары Bridgestone-Blizzak WS-60-225/50 R17-зима-комплект
3: 99999973        Недвижимость                   Квартиры                                1-к квартира, 39 м²
4: 99999974              Услуги          Предложения услуг                 Ремонт, отделочные работы под ключ
5: 99999977 Бытовая электроника              Аудио и видео                                     Nokia оригинал
6: 99999982         Личные вещи Товары для детей и игрушки                          Продам мобиль на кроватку
                                                                                                                                                                                                                                                    description
1: 2 одинаковых балкона размер 1560(ширина)*1050(высота) по две секции , на 2 полозных рамах,белые,новые.В комплекте есть зацепы и язычки для замков.Баконы абсолютно новые(ошиблись в размере,не устанавливались)Цена 4000 одна конструкция,две отдам за 7000.
2:                                                                                                                  Комплект 4 шины. Протектор 5-6 мм,равномерный износ. ^p Стоимость комплекта 16 000 рублей ^p Дополнительные номера телефонов ^p 8-@@PHONE@@
3:                                                                                                                                                                                                                                 пустая.после ремонта.риэлтор
4:                                                            Отделочные работы. Комплексный ремонт квартир, домов. ^p - выравнивание стен, потолков ^p - гипсокартон ^p - устройство откосов ^p - шпаклёвка ^p - окраска водоимульсионными составами ^p - обои
5:                                                                                                                                                                                                                                         в отличном состоянии
6:                                                                                                                 Механический.В отличном состоянии.Также могу отдать крепеж,но он переломлен пополам,но там вполне можно склеить клеем и прослужит еще(фото).
                                                                                                                          attrs price is_proved is_blocked phones_cnt emails_cnt urls_cnt close_hours
1:                                                                                          {""Вид товара"":""Окна и балконы""}  4000        NA          0          0          0        0        0.69
2:                                                           {""Вид товара"":""Шины, диски и колёса"", ""Тип товара"":""Шины""} 16000        NA          0          1          0        0        0.04
3: {""Тип объявления"":""Сдам"", ""Количество комнат"":""1"", ""Срок аренды"":""На длительный срок"", ""Адрес"":""Автовокзал""} 11000        NA          0          0          0        0        0.20
4:                                                                                   {""Вид услуги"":""Ремонт, строительство""}     0        NA          0          0          0        0       23.50
5:                                                                                                {""Вид товара"":""Наушники""}   300        NA          0          0          0        0        5.72
6:                                                                                                 {""Вид товара"":""Игрушки""}   300        NA          0          0          0        0       19.08

.

> dim(DT)
[1] 3995803      13

.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000      # i.e. my slow netbook (4GB RAM)
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1
Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • How did you download it and use it from GitHub? The way I download and use it now is from CRAN like `install.packages('data.table')` and `require(data.table)`. Is there an example available for the way you did it? – user1477388 Jun 26 '14 at 18:21
  • 1
    @user1477388 On the [GitHub page](https://github.com/Rdatatable/data.table/) there are instructions. Scroll down a bit to the README. It's one command. – Matt Dowle Jun 26 '14 at 19:26
  • Hmm, thanks but the compilation keeps failing when I try to run that command: `Warning: running command 'make -f "Makevars" -f "C:/PROGRA~1/R/R-31~1.0/etc/i386/Makeconf" -f "C:/PROGRA~1/R/R-31~1.0/share/make/winshlib.mk" SHLIB="data.table.dll" OBJECTS="assign.o bmerge.o chmatch.o dogroups.o fastmean.o fastradixdouble.o fastradixint.o fcast.o fmelt.o forder.o fread.o gsumm.o init.o rbindlist.o reorder.o uniqlist.o vecseq.o wrappers.o"' had status 127 ERROR: compilation failed for package 'data.table'` – user1477388 Jun 26 '14 at 19:35
  • @user1477388 Sorry, haven't seen that before. I don't have Windows. You'll need to search and then ask a new question about `github_install` on Windows. Have you read its documentation and installed Rtools? I doubt it's anything specific to `data.table` since other Windows users are reporting success with installing `data.table` on Windows from GitHub. – Matt Dowle Jun 26 '14 at 19:46
  • 1
    @user1477388 For example here: http://stackoverflow.com/questions/24375832/fread-and-column-with-a-trailing-backslash/24377685#comment37731803_24377685. It's not a similar lock issue on .dll is it? – Matt Dowle Jun 26 '14 at 19:52
  • I read that and another post after googling which also indicates that I just need to start a new R session. I believe I do this by restarting the RStudio application itself, no? I have tried that but I still get the same error. – user1477388 Jun 26 '14 at 19:57
  • Quick update: I cleared the R session by going to Session -> Clear Workspace and Session -> Restart R but after running the command `devtools::install_github("data.table", "Rdatatable")`, I get the same error. – user1477388 Jun 26 '14 at 20:18
  • 1
    @user1477388 Have you installed [Rtools](http://cran.r-project.org/bin/windows/Rtools/) as I asked above? – Matt Dowle Jun 26 '14 at 20:32
  • 1
    Yep, it seems problem was with older version installed by default. Thank you Matt, I updated 'data.table' and it works perfectly. – Darius Jun 27 '14 at 07:19
  • @MattDowle I installed it. During the installation is asked me for an "additional task"; I didn't know what it was so I didn't check the box. Anyway, after installing and running the `devtools::` command, I get the same error as before. – user1477388 Jun 27 '14 at 12:52
  • @user1477388 Please ask a new question and be uber specific. We need to know exactly what you're doing. Good advice [here](http://stackoverflow.com/help/how-to-ask). It doesn't sound to me like a problem to do with `data.table` per se. Related to RStudio possibly, your install of Rtools, paths etc. – Matt Dowle Jun 27 '14 at 13:06
  • 1
    @MattDowle Ok, I have created a new question with screenshots here http://stackoverflow.com/questions/24452934/package-installation-failed-when-trying-to-install-from-github – user1477388 Jun 27 '14 at 13:23
  • @MattDowle By the way, I was finally able to use `data.table` with great success. However, the problem is now that I don't get the Russian text that you get when you call `head(DT)` above. I get strangely encoded text as shown in my question http://stackoverflow.com/questions/24437479/rstudio-output-of-russian-characters I don't know if you can help but I just thought I'd ask. Thanks, anyway. – user1477388 Jun 27 '14 at 20:22
  • 1
    @user1477388 Can you upgrade to Linux or Mac? – Matt Dowle Jun 27 '14 at 22:15
  • @MattDowle :) I like how you worded that. Unfortunately, I cannot as I am quite into windows development being primarily a .net programmer. – user1477388 Jun 28 '14 at 02:02