I have a data frame which was created by importing several .csv
files and subsequently merging them all together.
Each of the data frames that I read in all have the column headings on row 8, with some descriptive text in the first seven rows.
This is why the duplicate rows have occurred - because I cannot use the values in row 8 from the first data frame and then discard the first 8 rows from the rest of the data frames (or perhaps I can - I'm sure it's possible).
Ultimately, what I want to happen is this:
- Read first .csv into data frame.
- Take values of row 8 to be column names
- Delete the first 8 rows.
- Read all other .csv files in, remove the first 8 rows from each one, and merge them all into the same data frame.
I am now faced with a problem where some of the rows will contain the same values as their corresponding column names.
For example, the merged data frame now looks something like this:
--------------------------
| Name | Age | MonthBorn |
-------------------------
| Bob | 23 | September |
| Steve| 45 | June |
| Name | Age | MonthBorn | # Should be removed
| Sue | 74 | January |
| Name | Age | MonthBorn | # Should be removed
| Tracy| 31 | February |
--------------------------
The trouble is that the combined data frame is almost 340,000 rows deep so I can't go through manually and check everything by hand. Also, I have a rough idea where each row might appear, but I can't be certain as there is a chance of variation.
How can I either check to see if the value of a row/cell matches the corresponding column name or set up the import process as outlined (bulleted) above?