0

I'm learning R in my Data-Driven Business course at my university, so it's brand new to me. I'm making a data project that analyzes data from a .csv file.

I have tried this, and it doesn't provide me with the right kind of result.

My problem is removing rows based on values from the column "Year_Birth".
I have tried:

# Read a csv file using read.csv()
csv_file = read.csv(file = "filtered_data.csv", 
                    stringsAsFactors = FALSE, header = TRUE)

BabyBoomer = csv_file$Year_Birth[ csv_file$Year_Birth >= 1946 & csv_file$Year_Birth <= 1964]
head(BabyBoomer)

print::
[1] 1957 1954 1959 1952 1946 1946

y = csv_file$Year_Birth[csv_file$Year_Birth <= 1964]
BabyBoomer <- csv_file[-c(y), ]
head(BabyBoomer)

print:: df but without something changed

I would like to be able to create a subset with all rows deleted beside those <= 1964

AnxiousDino
  • 187
  • 15
  • 1
    Subset the whole dataframe with `df[, ]` where you don't filter cols here: `filtered_df <- csv_file[csv_file$YearBirth <= 1964, ]`. Docs are a little hard to find, but you can get to them with `?\`[.data.frame\`` – alistaire Nov 02 '22 at 19:22

2 Answers2

2
y = csv_file$Year_Birth[csv_file$Year_Birth <= 1964]

After executing the snippet above, y will contain a vector of Year_Birth <= 1964 but what you need to extract the subset you desire is a vector containing the indices of the data.frame where Year_Birth <= 1964. This code will do that:

y <- which(csv_file$Year_Birth <= 1964)
BabyBoomer <- csv_file[ y, ]
head(BabyBoomer)
br00t
  • 1,440
  • 8
  • 10
  • 3
    Thanks! Much appreciated :) I need to look up the difference between <- and = in R. I will also look into the "which" method. Thanks again. :) – AnxiousDino Nov 02 '22 at 19:24
  • 2
    Note that `<-` is the assignment operator and basically equivalent to `=` , but there are some subtle differences you should be aware of. Please see https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators for further info – br00t Nov 02 '22 at 19:27
1

Try using the y <- subset() function. With that you can say subset(dataset, dataset$year <= 1946).

EDIT: you can also then say if you only want a vector containing years, you can say subset(dataset$year, dataset$year <= 1946)

Check out this documentation, helped me a lot to get started: https://homerhanumat.github.io/elemStats/

Hope this helps!

  • 2
    Thanks for replying, this method works to separate the rows as I requested, but when I input y into hist() to make a histogram, it prints:: Error in hist.default(y$Year_Birth, 50) : character(0) In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf – AnxiousDino Nov 02 '22 at 19:44
  • 2
    Which one are you using right now? The subset(dataset, ...) or the subset(dataset$year, ...) one? – Iwan de Jong Nov 02 '22 at 19:49
  • 2
    y = subset(csv_file, csv_file$Year_Birth < 1946) is not feasible for hist(y, 50) – AnxiousDino Nov 02 '22 at 19:51
  • 2
    Since you're creating a histogram for the Year_Birth variable, you need to say hist(y$Year_Birth, 50). In terms of y, the subset is still a whole data set, so you need to reference ($) the Year_Birth. I hope this solves your problem – Iwan de Jong Nov 02 '22 at 20:24