0

In my data huge .csv dataframe, I have merged 100+ csvs through cmd. This includes the headers. Now, I wish to delete the following duplicated header from my master csv in R:

Year|RecID|ParID|ConParID|Country|Division|RegCnty|RegDist|SubDist|RC|RD|RSD|Parish|Area|Part|Population|MalePop|FemalePop|NoOfInstit|InstitPop|ParType|Censusref|ImageRef|PageType|DocType|EnuDist|BuildType|BTCode|NoOfRooms|NoOfRoomsCode|Schedule|H|Absent|Absentcode|HSS|InstName|InstDesc|VessName|VessPos|PID|Sex|SexInf|Age|Cage|AgeInf|Cond|Mar|MarInf|Relat|Rela|RelInf|HeadInf|Occ|HollerOcc|Occode|HISCO|Industry|HollerInd|Employ|EmployCode|AtHome|Inactive|Disab|DisCode1|DisCode2|Bpstring|BpCmty|Std_Par|BpCnty|Cnti|Alt_Cnti|BpCtry|Ctry|Alt_Ctry|HollerB|Nationality|Lang|Langcode|YearsMar|MarYear|ChildTot|ChildAlive|ChildDead|ChildrenCode|HHD|H_Sex|H_Age|H_Rela|H_Mar|H_Occ|H_CFU|SameName|CFU|n_CFUs|tn_CFUs|CFUsize|Spouse|Father|Mother|f_Off|m_Off|m_Offm|f_Offm|Offsp|Kids|Relats|Inmates|Servts|Non_Rels|Visitors|Military

This header appears as many times as there were initial csv files, and not at a regular interval. How can I select all rows containing this header to include it in the following code:

myData <- myData[-c(...)]

Any help appreciated or other alternative solutions. It's big data, so I cannot open and remove duplicates in excel.

Rye Rye
  • 57
  • 5
  • 1
    Better to not read them in in the first place. `read.csv` has the `header` argument, for example. Details in the online doc. – Limey Feb 25 '21 at 10:03
  • I have hundreds of files, with names such as 8721dnis843284, this would take an incredible amount of time and diligence to compile.... I'll look into it though. But I think the approach I have taken has the easiest solution... – Rye Rye Feb 25 '21 at 10:07
  • 2
    Do you still have those 100+ csv? Instead of merging them using cmd, do it all in R, see this post: https://stackoverflow.com/q/11433432/680068 – zx8754 Feb 25 '21 at 10:08
  • 2
    Get the file names in a list. `list.files`? Then `bind_rows(lapply, fileList, read.csv, header=TRUE)` or similar. – Limey Feb 25 '21 at 10:10

1 Answers1

1

Instead of merging them in cmd it is advised to do so in R, as merging all data (UNION) along with header as rows, will result in changing the column type to strings everywhere and you'll have to do a lot of work to change their types all over again. See this answer for complete help as to how merge these in R itself.

If still you have a merged data that you don't want to repeat the steps all over again, you can remove the header rows in R by this command.

Obviously Year column won't have value Year except in header rows so do this

myData <- myData[myData$Year != 'Year',]

myData$Year != 'Year' will have True for only meaningful rows and replace myData with subset of these meaninful i.e. (non-header) rows only.

If you have Year column values equal to 'Year' anywhere use this logic on some other column

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    This would work, but then you end up with a dataframe with all columns class of character. Better read all in a loop and let R convert to suitable classes, then rowbind. – zx8754 Feb 25 '21 at 10:11
  • But, if OP has already an UNION of data in such a way that header rows are repeated across the data and still don't want to do it all over again, then it is sure that all columns are of character type only – AnilGoyal Feb 25 '21 at 10:14
  • 1
    Exactly, since all in character, with this answer we are creating another problem to solve: convert all noncharacter to specific classes. Better avoid that step, and let R do it automagically when importing each file separately, then simply rowbind. – zx8754 Feb 25 '21 at 10:21
  • agree it with fully – AnilGoyal Feb 25 '21 at 10:24
  • Thank you guys. Can someone sum up the approach I should be taking? – Rye Rye Feb 25 '21 at 10:35
  • 1
    @RyeRye Please see the linked posts. – zx8754 Feb 25 '21 at 10:42