-1

pretty new to R and i've come across a weird CSV my boss needs me to simplify for use.

I have a .csv that has multiple sets of data. Usually fine but each set of data is separated with a empty line and then the headers for the next set of data (the headers are always the same). Only problem is that there is no identifying column to let me just remove these rows. So for each set of data I need to remove the empty Row and Headers whilst giving each set a unique value in a new column. I hope this makes sense.

Honestly I've not got a clue handle this situation and I can't find any questions I could adapt. At least not at my current knowledge.

Any help will be greatly appreciated and might help me convince my boss to drop his mandraulic ways...

Example Image

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Hi! Sharing data as images isn't very helpful. Can you post some of the actual csv? Or some of the data.frame after using read.csv? Also see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Axeman Jun 04 '20 at 20:02
  • My suggestion would be to read the entire file (if it's not too large) using `readLines()`. Determine which lines are blank lines. You can then use this information in read.csv -- see the `nrows` and `skip` arguments. These tell you how many lines to read, and how many lines to skip before reading, respectively. Look at help(read.csv) for information on these arguments. So you would then read in separately as many datasets as you have, and at the end you can stack them, using `rbind`. You will need to be careful with some things along the way, but without more info, that's that best I can do – bzki Jun 04 '20 at 20:05
  • Does this answer help? https://stackoverflow.com/a/61091354/3358272? – r2evans Jun 04 '20 at 20:16
  • You could adapt this to find the empty column, too. Honestly, perhaps the easiest approach would be to open in Excel (or similar) and at *least* cut apart the side-by-side tables, the `readMultiCSV` function in my previous comment should be able to split apart the rest. – r2evans Jun 04 '20 at 20:22
  • @bzki I'm getting what you mean and i'll play around with this. Seems like this would be the most robust solution. Need to make it as user friendly as possible but hopefully could get this to work for all cases. I'll play today and try work it though like that. Thanks for the help and perspective! – Owen Llewellyn Jun 05 '20 at 01:33

1 Answers1

0

Edited because I missed the desired "set" column..

If your data is as you say it is from the picture you might be able to just... It worked for me on a small sample I built in excel made up to be a bit like yours. I'm assuming "Current" and "Desired" are not actually in the file and the first row starts with the header

library(dplyr)

yourdata <- read.csv("yourcsv.csv")

yourdata %>% 
  mutate(Set = cumsum(!duplicated(Location == "Location"))) %>% 
  filter(Location != "Location", Location != "") 
#>   Location Feature Height Volume Set
#> 1     U104       3    104     99   1
#> 2     U104       5    123    456   1
#> 3     U104       6     45    999   2
#> 4     U105       6     45    999   2

Based on data

structure(list(Location = c("U104", "U104", "", "Location", "U104", 
"U105"), Feature = c("3", "5", "", "Feature", "6", "6"), Height = c("104", 
"123", "", "Height", "45", "45"), Volume = c("99", "456", "", 
"Volume", "999", "999")), class = "data.frame", row.names = c(NA, 
-6L))
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • 1
    Cheers! I'll try make a better question next time but glad you could understand it at least. The only one issue is the Location isn't likely to be the same in the fills but I do think I can get this working for 90% of the CSV's and with the nature of the data we work with (Solder Paste Inspection) pretty sure I know another really helpful aspect with identifying Array on a PCB. – Owen Llewellyn Jun 05 '20 at 01:44
  • Glad this was helpful come back with more questions and better example data if you need to. `dput` or even just a snippet of your csv file are very helpful – Chuck P Jun 05 '20 at 11:51