0

I have a data frame which was created by importing several .csv files and subsequently merging them all together.

Each of the data frames that I read in all have the column headings on row 8, with some descriptive text in the first seven rows.

This is why the duplicate rows have occurred - because I cannot use the values in row 8 from the first data frame and then discard the first 8 rows from the rest of the data frames (or perhaps I can - I'm sure it's possible).

Ultimately, what I want to happen is this:

- Read first .csv into data frame.
- Take values of row 8 to be column names
- Delete the first 8 rows.
- Read all other .csv files in, remove the first 8 rows from each one, and merge them all into the same data frame.

I am now faced with a problem where some of the rows will contain the same values as their corresponding column names.

For example, the merged data frame now looks something like this:

--------------------------
| Name | Age | MonthBorn |
-------------------------
| Bob  | 23  | September |
| Steve| 45  | June      |
| Name | Age | MonthBorn | # Should be removed
| Sue  | 74  | January   |
| Name | Age | MonthBorn | # Should be removed
| Tracy| 31  | February  |
--------------------------

The trouble is that the combined data frame is almost 340,000 rows deep so I can't go through manually and check everything by hand. Also, I have a rough idea where each row might appear, but I can't be certain as there is a chance of variation.

How can I either check to see if the value of a row/cell matches the corresponding column name or set up the import process as outlined (bulleted) above?

M--
  • 25,431
  • 8
  • 61
  • 93
Mus
  • 7,290
  • 24
  • 86
  • 130
  • 2
    How did you wind up with such a merged data.frame? Seems like those headers were improperly imported in the first place. It's better to give a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to show what's really going on. Probably better to prevent the problem in the first place than fix it down stream. – MrFlick Aug 29 '17 at 14:41
  • @d.b I did set `header = TRUE`. Each file is a snippet of a larger whole, each with it's own set of (identical) headers, and which was extracted from an online database (the export process automatically limits .csv sizes). – Mus Aug 29 '17 at 16:00

3 Answers3

1

Your data

df <- structure(list(Name_ = c("Bob", "Steve", "Bob", "Name", "Sue", 
"Name", "Tracy"), `_Age_` = c("23", "45", "23", "Age", "74", 
"Age", "31"), `_MonthBorn` = c("September", "June", "September", 
"MonthBorn", "January", "MonthBorn", "February")), .Names = c("Name_", 
"_Age_", "_MonthBorn"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

solution

library(stringr)
df[!sapply(1:nrow(df), function(x) all(mapply(function(x,y) str_detect(x,y), colnames(df), df[x,]))),]

Output

   Name_ _Age_ _MonthBorn
1:   Bob    23  September
2: Steve    45       June
3:   Bob    23  September
4:   Sue    74    January
5: Tracy    31   February
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thanks for pointing that out. I believe OP will want to filter this out as well, but OP can correct me if I'm mistaken – CPak Aug 29 '17 at 14:59
  • I've edited my answer to handle *rows will contain the same values as their corresponding column names* – CPak Aug 29 '17 at 16:06
1

We can use functions from dplyr and tidyr to combine the content of all columns together. After that, filter out those that are the same as the combine column names. dt2 is the final output.

# Create example data
dt <- read.table(text = "Name Age MonthBorn
Bob 23 September
Steve 45 June 
Bob 23 September
Name Age MonthBorn
Sue 74 January
Name Age MonthBorn
Tracy 31 February",
                 header = TRUE, stringsAsFactors = FALSE)

# Load package
library(dplyr)
library(tidyr)

# Process the data
dt2 <- dt %>%
  unite(ColName, everything(), sep = ", ", remove = FALSE) %>%
  filter(ColName != toString(colnames(dt))) %>%
  select(-ColName)

dt2
   Name Age MonthBorn
1   Bob  23 September
2 Steve  45      June
3   Bob  23 September
4   Sue  74   January
5 Tracy  31  February
www
  • 38,575
  • 12
  • 48
  • 84
0

If your data frame looks approximately as follows:

Df <- Data.frame(Name, Age, MonthBorn)

Then you could use an ifelse statement to test if "MonthBorn" shows up in a row.

Df$MonthBornTest <- ifelse(Df$MonthBorn == “MonthBorn”, “True”, “False”}

Then you should be able to do this to remove the rows that contain True, effectively dropping the rows you don't want anymore.

Df <- Df[!(Df$MonthBornTest == “True”), ]
gmaly
  • 91
  • 4