2

I apologize if this seems simple, but I have spent an absurd amount of time chasing Google searches in circles and I can't seem to find an answer to this problem...

I have a series of excel files (of .csv type) that I need to import. They are composed of data consisting of 9 columns separated by |. Looks something like this:

1  CustID|Animal Adopted|Description|Year|Adoption date|dob|Fee|Paid        
2  123|Horse|I got a Horse|2011|2011-01-31|1991-03-01|8.00|NO       
3 0101|Horse|I got a Horse|2006|2006-01-31|1987-21-01|50.00|NO      
4 1929|Fish|I adopted a fish    and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
5 199|Fish|I adopted a fish and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
6 0101|Dog|another really long description  for adopting a dog because they are awesome 
7 |2003|2003-05-11|1999-04-01|50.00|(null)      
8 0102|Dog|another really long description  for adopting a dog because they are awesome 
9 |2003|2003-05-11|1989-04-01|51.00|(null)      
10 9330|Cat|Mr.Kitty is a kitty who is so fat   and cute    and furry|2010|2010-10-11|1986-10-20|10000.00|YES
11 1933|Hamster|Doing a dance   in tube city.   
12 |2011|2011-12-05|2005-06-01|4.00|(null)  

Thus when imported into R it does this

    CustID Animal.Adopted                                                                  Description       Year Adoption.date        dob   Fee Paid..
1     123          Horse                                                                I got a Horse       2011    2011-01-31 1991-03-01     8   NO,,
2     101          Horse                                                                I got a Horse       2006    2006-01-31 1987-21-01    50   NO,,
3    1929           Fish  I adopted a fish,and for some reason,this requires a very long description        2001    2001-12-31 1991-10-01     5    YES
4     199           Fish  I adopted a fish,and for some reason,this requires a very long description        2001    2001-12-31 1991-10-01     5    YES
5     101            Dog another really long description,for adopting a dog because they are awesome,                                        NA       
6      NA           2003                                                                   2003-05-11 1999-04-01         50.00   (null),,    NA       
7     102            Dog another really long description,for adopting a dog because they are awesome,                                        NA       
8      NA           2003                                                                   2003-05-11 1989-04-01         51.00   (null),,    NA       
9    9330            Cat                         Mr.Kitty is a kitty who is so fat,and cute,and furry       2010    2010-10-11 1986-10-20 10000    YES
10   1933        Hamster                                                 Doing a dance,in tube city.,                                        NA       
11     NA           2011                                                                   2011-12-05 2005-06-01          4.00   (null),,    NA       
> 

This severely impacts the accuracy of my results later when I try to categorize my data. Lines 5 & 6, 7 & 8, and 10 & 11 represent the problem, as half of the column data is on the first row and the other half is in the second.

I have found this question @Layla from a couple years ago regarding reading in text files line by line and am attempting to blend the accepted answer with a personalized for loop. However, I don't understand why the while loop condition is TRUE...

I was wondering if there was a way to code it so that every time r encounters the row with only three columns, it bumps up the next row to fill in the last two columns. But I would only want that to occur with perfectly matching data because I may be missing random names or locations throughout the data as I don't want to mismatch the entire data set as previously mentioned.

I don't know how to instruct the loop to look out for the missing data. I tried something like if line(ncol = 3) {Merge()} but lets be honest, that makes no sense...

I've tried readLines() ... but that is still reading everything in the exact same way so I figure this needs to be combine with other code. And all of my Google searches just turn up the same basic "readLines reads the connection in one line at a time" and then that's it.

I feel like I have the pieces but I still can't figure out how to put them together.

As a caveat, I am quite new to this. Even a helpful article that explains how to write a routine that reads in code manually would be helpful. There is so much information on how to read in files using read.csv, read.tsv, read.xlsx, etc. but very very little on how to handle reading in data manually.

rstudent
  • 21
  • 2
  • 1
    You'll probably have to read in the lines and process this manually. – Roman Luštrik Jul 09 '17 at 21:11
  • Create a code that reads the data in line by line? I'm good with doing that but I have no idea how and Google is producing no help – rstudent Jul 09 '17 at 21:13
  • 1
    Use `readLines` in `base R` or `read_lines` in `readr` package – CPak Jul 09 '17 at 23:23
  • Why is this package so under notated? I'm not sure how this works, so I'm just running my file through it but my entries are still processing into separate rows. – rstudent Jul 10 '17 at 00:26
  • Referring to the answer from @oddHypothesis, you can get the header, scan through the text files, output a vector, and transform the output vector to matrix. https://stackoverflow.com/questions/2040026/how-to-import-data-with-line-breaks-from-text-file-into-r – raymkchow Jul 10 '17 at 01:48
  • @raymkchow the issue with this is that `scan()` separated out each individual word and scrambled the entire data frame. I think I need to combine readLines with a loop or if statements – rstudent Jul 10 '17 at 02:01
  • Agree that scan would not be a good choice, and readLines would be better, but would suggest you further look at the function `count.fields` with the use of `sep="|"`. The other way to proceed would be to make a tibble but do the read operation with all columns as "character". You would help your own cause by producing an adequate example of the top of the raw csv file rather than the imprecise description at present. – IRTFM Jul 10 '17 at 17:17
  • I appreciate the comment @42. I have updated the question and content to the best of my ability to create a more accurate representation of the situation. – rstudent Jul 10 '17 at 18:53
  • Does your raw data really have the row numbers? If so, how are you currently reading it in to separate those from the first column? – Aaron left Stack Overflow Jul 10 '17 at 19:41
  • @Aaron no I apologize. Those are the lines in the excel file. That was a misunderstanding on my part. – rstudent Jul 10 '17 at 20:06
  • Wait, what? No, the csv file doesn't have the row numbers? Or yes, "those are [exactly] the lines in the [csv] file"? – Aaron left Stack Overflow Jul 10 '17 at 20:13
  • No, the csv file does not have row numbers. I mistakenly thought that raw data meant producing an example that replicated what the data looked like in the document. So I edited the row numbers into my example above to match the spreadsheet but those are not included in the lines in the file. – rstudent Jul 10 '17 at 20:22
  • Thanks, that's what I guessed. As perhaps you're discovering, Excel is not a great tool for working with csv files and should not be considered the "document". – Aaron left Stack Overflow Jul 10 '17 at 20:52

2 Answers2

1

I found the answer to "How to import data with line breaks from text file into R?" helpful. But first I thought it might be helpful to change the "|" separators to commas.

txt <- "1  CustID|Animal Adopted|Description|Year|Adoption date|dob|Fee|Paid        
2  123|Horse|I got a Horse|2011|2011-01-31|1991-03-01|8.00|NO       
3 0101|Horse|I got a Horse|2006|2006-01-31|1987-21-01|50.00|NO      
4 1929|Fish|I adopted a fish    and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
5 199|Fish|I adopted a fish and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
6 0101|Dog|another really long description  for adopting a dog because they are awesome 
7 |2003|2003-05-11|1999-04-01|50.00|(null)      
8 0102|Dog|another really long description  for adopting a dog because they are awesome 
9 |2003|2003-05-11|1989-04-01|51.00|(null)      
10 9330|Cat|Mr.Kitty is a kitty who is so fat   and cute    and furry|2010|2010-10-11|1986-10-20|10000.00|YES
11 1933|Hamster|Doing a dance   in tube city.   
12 |2011|2011-12-05|2005-06-01|4.00|(null)
"

Edit:

I realized that considering this as text and replacing only the EOL characters that were followed by "nn<spc>|" with

 txt2 <- gsub("\\n\\d{1,4} {1,3}\\|", "|", txt)

Gets a regular result:

 count.fields( textConnection(txt2),sep="|")
[1] 8 8 8 8 8 8 8 8 8

Now read the rest of the file with read.table.

> read.table( text=txt2,sep="|", header=TRUE)
  X1..CustID Animal.Adopted
1     2  123          Horse
2     3 0101          Horse
3     4 1929           Fish
4      5 199           Fish
5     6 0101            Dog
6     8 0102            Dog
7    10 9330            Cat
8    11 1933        Hamster
                                                                     Description Year Adoption.date
1                                                                  I got a Horse 2011    2011-01-31
2                                                                  I got a Horse 2006    2006-01-31
3 I adopted a fish    and for some reason this requires a very long description  2001    2001-12-31
4    I adopted a fish and for some reason this requires a very long description  2001    2001-12-31
5  another really long description  for adopting a dog because they are awesome  2003    2003-05-11
6  another really long description  for adopting a dog because they are awesome  2003    2003-05-11
7                      Mr.Kitty is a kitty who is so fat   and cute    and furry 2010    2010-10-11
8                                               Doing a dance   in tube city.    2011    2011-12-05
         dob   Fee         Paid
1 1991-03-01     8    NO       
2 1987-21-01    50     NO      
3 1991-10-01     5          YES
4 1991-10-01     5          YES
5 1999-04-01    50 (null)      
6 1989-04-01    51 (null)      
7 1986-10-20 10000          YES
8 2005-06-01     4       (null)
> 
Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Here I assume that the row number is not actually part of the data set, as you are reading that in correctly already, so I gsub it out first. See end of answer for my data input.

I notice that the second of the two rows always starts with the separator |; that makes it easy to remove the newline before it, and then it can be read in normally. Other options I tried before realizing that are below.

> txt3 <- gsub("\n|", "|", txt, fixed=TRUE)
> read.table(text=txt3, sep="|", header=TRUE)
  CustID Animal.Adopted                                                                    Description Year Adoption.date        dob   Fee   Paid
1    123          Horse                                                                  I got a Horse 2011    2011-01-31 1991-03-01     8     NO
2    101          Horse                                                                  I got a Horse 2006    2006-01-31 1987-21-01    50     NO
3   1929           Fish I adopted a fish    and for some reason this requires a very long description  2001    2001-12-31 1991-10-01     5    YES
4    199           Fish    I adopted a fish and for some reason this requires a very long description  2001    2001-12-31 1991-10-01     5    YES
5    101            Dog  another really long description  for adopting a dog because they are awesome  2003    2003-05-11 1999-04-01    50 (null)
6    102            Dog  another really long description  for adopting a dog because they are awesome  2003    2003-05-11 1989-04-01    51 (null)
7   9330            Cat                      Mr.Kitty is a kitty who is so fat   and cute    and furry 2010    2010-10-11 1986-10-20 10000    YES
8   1933        Hamster                                               Doing a dance   in tube city.    2011    2011-12-05 2005-06-01     4 (null)

Because the second of the two rows always starts with the separator |, the CustID column is NA when you read it in. You can use this to move those rows earlier. Also note that now all the fields are character, not numeric, because when originally read in, there were characters in those fields; you'll have to convert these after reading.

d <- read.table(text=txt, sep="|", fill=NA, header=TRUE, as.is=TRUE)
k <- which(is.na(d$CustID))
d[k-1,4:8] <- d[k,2:6]
d[-k,]

Alternately, if the break between rows might not always be at the same column, we can read it in, split into units, and then make the matrix.

txt2 <- txt
txt2 <- gsub("\n", "|", txt2, fixed=TRUE)
txt2 <- gsub("||", "|", txt2, fixed=TRUE)
txt2 <- strsplit(txt2, "|", fixed=TRUE)[[1]]
d2 <- matrix(txt2, ncol=8, byrow=TRUE)
colnames(d2) <- d2[1,]
d2 <- as.data.frame(d2[-1,], as.is=TRUE)

data input:

txt <- "1  CustID|Animal Adopted|Description|Year|Adoption date|dob|Fee|Paid        
2  123|Horse|I got a Horse|2011|2011-01-31|1991-03-01|8.00|NO
3 0101|Horse|I got a Horse|2006|2006-01-31|1987-21-01|50.00|NO
4 1929|Fish|I adopted a fish    and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
5 199|Fish|I adopted a fish and for some reason this requires a very long description |2001|2001-12-31|1991-10-01|5.00|YES
6 0101|Dog|another really long description  for adopting a dog because they are awesome 
7 |2003|2003-05-11|1999-04-01|50.00|(null)
8 0102|Dog|another really long description  for adopting a dog because they are awesome 
9 |2003|2003-05-11|1989-04-01|51.00|(null)
10 9330|Cat|Mr.Kitty is a kitty who is so fat   and cute    and furry|2010|2010-10-11|1986-10-20|10000.00|YES
11 1933|Hamster|Doing a dance   in tube city.   
12 |2011|2011-12-05|2005-06-01|4.00|(null)
"
txt <- sub("1 *", "", txt)
txt <- gsub("\n[0-9]+ *", "\n", txt)
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • Your answer "more complete" in the sense that you removed the leading line numbers. Upvoted. – IRTFM Jul 10 '17 at 22:39