0

Does anyone know what type of charachter is this? I'm tryin to import this file in my R Session. The columns are tab separated (actually it looks to me that some are separated with 1 tab and others with 2 tabs) but when i open this plain text file in my editor it shows these "up arrows".

here a portion of the file:

the file is 184.686 kB

my code in a new started R session is:

> library(tidyverse)
-- Attaching packages -------------------------------- tidyverse 1.3.1 --
√ ggplot2 3.3.5     √ purrr   0.3.4
√ tibble  3.1.6     √ dplyr   1.0.8
√ tidyr   1.2.0     √ stringr 1.4.0
√ readr   2.1.2     √ forcats 0.5.1
-- Conflicts ----------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

> path_T11 <- "Einzelteil/Einzelteil_t11.txt"
> T11 <- read_tsv(path_T11)
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
  * Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`

# Sys.setenv("VROOM_CONNECTION_SIZE" = 180000000) still generates error so i go up with
> Sys.setenv("VROOM_CONNECTION_SIZE" = 190000000)
> T11 <- read_tsv(path_T11)
New names:                                                                                                             
* `1` -> `1...10`
* `215` -> `215...12`
* `2154` -> `2154...13`
* `0` -> `0...14`
* `NA` -> NA...15
* ...
Rows: 0 Columns: 21467349
-- Column specification -------------------------------------------------------------------------------------------------
Delimiter: "\t"
chr (21467349): X1, ID_T11, Herstellernummer, Werksnummer, Fehlerhaft, Fehlerhaft_Datum, Fehlerhaft_Fahrleistung, Pro...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
> spec(T11)
#runs too long without giving any output and i stopped my R Session

After the file was wrongly "imported" i noticed that my R Session was using 16GB of memory:

Status of RStudio after the wrong import. System is using 16GB of memory

i am new to RStudio and R. I used read_tsv() because i thought the file was tab separated. I don't know if this problem (hence it says there are 0 rows) is due to the "up arrow" or the tab delimiter. I will be pleased if someone can give me an in depth overview of what's happening, and why the import was not succesfull. I would also like to know why my R session is taking up so much memory.

EDIT: it seems to me that columns are tab separated and rows are "up-arrow" separated. I say this, because the file on my macbook (textedit) looks like this: on mac: enter image description here

on windows: start of the file: beginning of file on windows end of file: end of file on windows

EDIT: using the UTF-8 Tool to identify the "up-arrow" charachter as mentioned in the comments result in this: UTF-8 Tool output Thank you in advance !

Bonito
  • 15
  • 7
  • Impossible to answer without knowing your data. So how does it look like? Is it tab delimited indeed? How many rows/columns should there be in your file? What are the column names of your data? Although a screenshot is usually not a good way to share your data, it would help here to understand how your file looks like. – deschen Feb 24 '22 at 13:45
  • @deschen how can i then give you an overview? It's a huge file. Column names are "X1" "ID_T11" "Herstellernummer" "Werksnummer" "Fehlerhaft" "Fehlerhaft_Datum" "Fehlerhaft_Fahrleistung" "Produktionsdatum_Origin_01011970" "origin". To me it looks like that columns are tab separated and rows are "up arrow" separated. i don't know a way to specify the row delimiter in any of the readr functions. – Bonito Feb 24 '22 at 13:50
  • Open the file outside of R, e.g. in a usual text editor for example (or even browser) and then make a screenshot that shows the column names and first 20 or so rows. – deschen Feb 24 '22 at 13:51
  • there are lots of new columns. Maybe you have set the wrong line feed. This happens e.g. the file was created on a Linux system and will be read on a Windows one – danlooo Feb 24 '22 at 13:51
  • Also, you should lnow how many columns/rows this file has - please share this info. In addition, if possible, creat a small subset of your data with e.g. only the first 20 rows, and then share the file with us e.g on Google drive or so. – deschen Feb 24 '22 at 13:53
  • Sorry, I just saw that you already shared a screenshot. The upwards arrows could indeed be a problem. – deschen Feb 24 '22 at 13:55
  • @deschen windows detects 1 row and 1891183 (but the number seems not fully readable) – Bonito Feb 24 '22 at 14:07
  • What do you mean with „windows detects“? – deschen Feb 24 '22 at 14:15
  • Use a command line tool [like this](https://stackoverflow.com/q/60034/903061) to replace the up arrows with common linebreaks `"\n"` in the file. Then it should read in just fine. – Gregor Thomas Feb 24 '22 at 14:38
  • @deschen i now edited the thread with two screenshots of the beginning and the end of the file on windows. in the bottom right corner you can see what i mean with "windows detects" – Bonito Feb 24 '22 at 17:35
  • @danloo i think it could be the problem. i edited the thread and posted a picture of how the file gets displayed by textedit on my macbook. – Bonito Feb 24 '22 at 17:36
  • @GregorThomas thank you for the idea. What do you think the "up-arrow" corresponds to in windows? Searching for it i found out it could be alt+24 but this delivers a different "up-arrow". – Bonito Feb 24 '22 at 17:41
  • You could probably just copy/paste it from your file to the replacement command. Or I googled "UTF 8 character identifier" and [found this](https://www.cogsci.ed.ac.uk/~richard/utf-8.html), see if that can tell you what it is. – Gregor Thomas Feb 24 '22 at 17:49
  • @GregorThomas using the tool i got out that it is a "Form Feed (FF)" but after googling it, it's not the same. See the new edit on the post for a screenshot. – Bonito Feb 24 '22 at 18:01
  • @GregorThomas i tried running on Powershell: (Get-Content Einzelteil_T11.txt) -replace '\^L','\n' | Out-File -encoding ASCII T11.txt ^L because if i copy and paste it on the command line it´s shown as ^L It created a new file T11.txt but this didn't change anything. Maybe i need to escape special charachters, but don't know how to do it correctly. – Bonito Feb 24 '22 at 18:14
  • Ok thanks. So question is: when and how do the upwards arrows make it to the file? Could you just open the file on your Mac with Excel and save as xlsx or csv? – deschen Feb 24 '22 at 18:55
  • Also, not sure if I did wrong counting, but it seems there are more columns than column names. I.e. I count 9 column names, but 10 columns. – deschen Feb 24 '22 at 18:57
  • @deschen opening with excel results in "file not loaded correctly" – Bonito Feb 24 '22 at 19:05

0 Answers0