reading a text file with no particular format in R

Question

I would like to read a text file which has variable number of columns in R. My text file looks like follows. I would like to read into a data frame where I can just access the individual lines as df1 for the first line, df[2] for the second line and so on.

a       09
abandon*        12      16      19      20      24
abort*  60      61      62
about   10
above   10      41      42
abrupt* 37
absolut*        26
abuse*  12      16      18
abusive 12      16      18
academi*        47      48
accept  08      12      13      15      20      22      39
accepta*        08      12      13      15
accepted        08      12      13      15      20      38

I have tried the following

read.table("myfile",header=T,sep=" ")

but this inserts tab characters.

I have also tried readLines

singleString = readLines("myfile")

but this too inserts \t characters.

EDIT : (Thanks to Dominic Comtois for helping thus far, I have got it to work but still don't know what's going wrong)

Initially the words on the left and the numbers on the right were separate by tabs, not spaces. And so R wasn't able to read it correctly

df = read.table('filename',sep="|") df$V1 = as.character(df$V1) df1,"V1" "a\t09"
So I replaced tabs with spaces in vi editor using the command :1,$s/\t/ /g
I was then able to read it in R using df = read.table('filename',sep="|") but certain lines looked like this (Surprisingly, when I read the file now, a lot of the apostrophes are gone. The first word was spelt as aren't, not it's arent)

df123, "arent 07 39\nargu 12 16 18 31 32\narm 60 61\narms 60 61\narmy 31\naround 10 41\narous 12 60 61\narrange 20\narrive 39 46\narrived 46\narrives 39 46\narriving 46\narrogan 12 16 18\narse 60 61 66\narses 66\narsehole 66\narter 60 61\narthr 60 61\nas 10\nasham 12 16\nashes 57 59\nask 27 29 31 32 39\nasked 27 29 31 32 38\nasking 27 29 31 32\nasks 27 29 31 32 39\nasleep 60 64\nass 60 61 66\nassault 12 16 18\nassembl 31\nasses 66\nasshole 66\nassociatio 47 49\nassum 20 21\nassur 12 13 15 26\nasthma 60 61\nat 10\nate 27 38 60 63\natop 10 41 42\nattachment 12 13 14\nattract 12 13\nauditorium 47 48\naugust 37\naunt 31 35\nautumn 37\naversi 12 16 17\navoid 12 16 20 24\nawake 60 64\naward 12 13 15 47 50\naware 20 22\naway 10\nawesome 12 13\nawful 12 16\nbabe 31 36\nbabies 31 36\nbaby 31 36\nbad 12 16\nband 31 51 55\nbank 56\nbar 60 63\nbarrier 20 24\nbars 60 63\nbaseball 47 48 51 53\nbases 20 21\nbasis 20 21\nbasketball 47 48 51 53\nbastard 12 16 18 66\nbath 51 52 60 65\nbe 40\nbeaten 12 16 18 47 50\nbeaut 12 13\nbecame 20 22 38\nbecause 20 21\nbecome 20 39\nbecomes 20 39\nbecoming 20\nbed 51 52 64\nbeen 38\nbeer 60 63\nbefore 10 37\nbeg 31 32 39\nbegan 37 38\nbegged 31 32 38\nbegging 31 32\nbegin 37 39\nbeginn 37\nbegins 37 39\nbegs 31 32 39\nbelieve 20 22 39\nbelieved 20 22 38\nbelieves 20 22 39\nbelieving 20 22\nbelly 60 61\nbelow 10 41 43\nbeneath 10 41 43\nbenefit 12 13 39\nbenefits 12 13 47 49\nbenign 12 13\nbeside 10 41\nbesides 45\nbest 12 13 15 47 50\nbet 25 39 56\nbets 25 39 56\nbetter 12 13 47 50\nbetting 25 56\nbetween 10 41\nbewilder 12 16 17\nbi 60 62\nbible 57 58\nbicyc 51 53\nbig 41\nbillion 11\nbinge 60 61 63\nbiology 47 48\nbitch 12 16 18 66\nbitter 12 16 18 27\nbladder 60 61\nblam 12 16 18 31 32\nbleed 60 61\nbless 12 13 57 58\nblock 20 24\nblood 60 61\nboard 47 49\nboarder 41\nbodi 60 61\nbody 60 61\nbold 12 13 15\nbone 60 61\nbonus 47 49\nboobs 60 61 62 66\nbook 47 48\nbore 12 16\nboring 12 16\nborrow 56\nboss 47 49\nbother 12 16\nbottom 41 43\nbought 38 56\nbowel 60 61\nboy 31 36\nboy 31 36"
So I wrote these to a new file as

write.table(df[grep("\n",df$V1),"V1"],'newlines')
But since we are writing so many sets of lines, it puts "" after every set. So I searched and replaced the " characters with empty, essentially removing them
I then opened them using the original commamnd and it worked, everything was in it's separate line

df = read.table('newlines',sep="|") df$V1 = as.character(df$V1)

I also opened the file after replacing tabs with spaces in a hex editor and did not see anything peculiar. This is the part from one line before where the problem starts

area  41
aren't  07  39
argu  12  16  18  31  32
arm  60  61

Corresponding hex

61 72 65 61 20 20 34 31 0A 61 72 65 6E 27 74 20 20 30 37 20 20 33 39 0A 61 72 67 75 20 20 31 32 20 20 31 36 20 20 31 38 20 20 33 31 20 20 33 32 0A 61 72 6D 20 20 36 30 20 20 36 31

If anyone would like to access the file, it can be found at http://aftabubuntu.cloudapp.net/LIWC2001_English.dic

Dominic Comtois · Accepted Answer · 2015-04-10T07:04:58.043

2

If you don't want to have the numbers considered as different "cells" or "fields", you can set sep as a character that is nowhere in your source file.

For instance:

df1 <- read.table("myfile",sep="|")

As for header=TRUE, this should be used only if your first line contains the names of your columns. If it's not the case, don't put it in. To skip the first line instead, just use skip=1.

Then you'll be able to access the individual lines with

df1[1,] # for first line
df1[2,] # for second line
        # and so on ...

edited Apr 10 '15 at 07:04

answered Apr 10 '15 at 06:51

Dominic Comtois

10,230
1
39
61

It does not seem to be working for me yet. I still get tab characters, can you have a look below. > df <- read.table("myfile",sep="|") > df$V1 =as.character(df$V1) > df[[1]][1] [1] "a\t09" > df[[1]][2] [1] "abandon\t12\t16\t19\t20\t24" > df[[1]][3] [1] "abort\t60\t61\t62" > df[[1]][4] [1] "about\t10" – tubby Apr 10 '15 at 07:25
1

Oh well that only indicates that the tab characters are in your input file! Open it with a text editor and replace tabs by spaces... or try using `sep="\t"` to have your dataframe well organized, if number of tabs are consistent. Or try with `sep="\t"` and `fill=TRUE`. – Dominic Comtois Apr 10 '15 at 07:28
Thanks @Dominic Comtois. I used set tabstop=2 shiftwidth=2 expandtab to replaces tabs with spaces, and then it worked. But we're using sep="|" as delimiter. So was the problem because read.table expects a space as the delimiter? – tubby Apr 10 '15 at 07:39
1

When you say "it worked", you mean you have a dataframe with only 1 column I presume? Yes, from `?read.table` : _If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns._ – Dominic Comtois Apr 10 '15 at 07:43
I notice that one of my lines looks like "heres 39 41 44\nhero 12 13 15\nhers 01 06 31 33\nherself 01 06 31 33\nhesitant 12 16 17 20 24 25\nhesitat 20 24 25 If I remove all the newlines, it of course doesn't work because of read.table. How do I get rid of these? – tubby Apr 10 '15 at 11:44
This is odd... maybe you have 2 types of line feed in your document? (Linux/Windows)? You could try copying and pasting the content in a new text document and try again. – Dominic Comtois Apr 11 '15 at 01:46
, I have made an edit to the question. I've got it to work, but not sure what the problem is. Please have a look if you get some time. – tubby Apr 11 '15 at 04:55
Try adding `quote=""`. I got good results with `read.table("http://aftabubuntu.cloudapp.net/LIWC2001_English.dic", header=FALSE, sep = "|", quote = "")` – Dominic Comtois Apr 11 '15 at 13:41
This is great. Thanks. I also read up about quote and about R's behavior to expect a closing quote for every single quote it sees, and consider everything between the two to be one line. It also clearly explains why my string was going wrong at the first word containing a single quote like aren't and getting back on track at the next single quoted word only to go wrong again at the next single quoted word. If you could add this an answer, I could accept it. – tubby Apr 11 '15 at 21:12

reading a text file with no particular format in R

1 Answers1