0

This is my data set in a CSV file

Weight   Count
Less than 500 grams 5,980
500 to 999 grams    22,015
1,000 to 1,499 grams    29,846
1,500 to 1,999 grams    63,427
2,000 to 2,499 grams    204,295
2,500 to 2,999 grams    744,181
 3,000 to 3,499 grams   1,566,755
 3,500 to 3,999 grams   1,055,004
 4,000 to 4,499 grams   262,997
 4,500 to 4,999 grams   36,706
 5,000 to 5,499 grams   4,216

I'm trying to make a histogram using frequency % however, I run into two problems

1) $Weight being considered a factor instead of numeric because of the alphanumeric text. I can't seem to find a way to have the range set in R to be all numeric so ggplot will accept the ranges. What would be the best way to go about this?

EDIT: I tried cut() to set intervals but because the original data is alphanumeric, it won't work.I get the error

newborn$Weight_cut<- cut(newborn$Weight, 
                       breaks = c(0, 500, 1000,1500,2000,2500,3000,3500,4000,4500,5000), 
                       labels = c("<500","500-999",
                                  "1000-1499","1500-1999",
                                  "2000-2499","2500-2999",
                                  "3000-3499","3500-3999",
                                  "4000-4499","4500-4999",
                                  "5000-5499"), 
                       right = FALSE)
Error in cut.default(newborn$Weight, breaks = c(0, 500, 1000, 1500, 2000,  : 
  'x' must be numeric

2) because I need to plot % freq as opposed to count I try to make a new column with % freq, but R is reading the $Count as a factor, so here is the code I tried... BUT it then changes all my count numbers and makes the frequencies completely wrong

> class(newborn$Count)
[1] "factor"
> newborn$Count <- as.numeric(newborn$Count)
> class(newborn$Count)
[1] "numeric"
> newborn$Percent = newborn$Count/sum(newborn$Count)*100
> newborn$Percent
 [1] 13.636364  6.060606  9.090909 15.151515  4.545455 16.666667  3.030303  1.515152  7.575758 10.606061 12.121212
> newborn
                  Weight Count   Percent
1    Less than 500 grams     9 13.636364
2       500 to 999 grams     4  6.060606
3   1,000 to 1,499 grams     6  9.090909
4   1,500 to 1,999 grams    10 15.151515
5   2,000 to 2,499 grams     3  4.545455
6   2,500 to 2,999 grams    11 16.666667
7   3,000 to 3,499 grams     2  3.030303
8   3,500 to 3,999 grams     1  1.515152
9   4,000 to 4,499 grams     5  7.575758
10  4,500 to 4,999 grams     7 10.606061
11  5,000 to 5,499 grams     8 12.121212
j681
  • 21
  • 2
  • You'll need to remove the commas from the `Count` column to read it as a number. See [here](https://stackoverflow.com/questions/1523126/how-to-read-data-when-some-numbers-contain-commas-as-thousand-separator) for some suggestions on how to do that. – aosmith Sep 15 '17 at 20:19
  • Perf - thanks! I encountered a problem follow that in that I couldn't get "newborn$Percent = newborn$Count/sum(newborn$Count)*100" to work, but when I replaced "sum(....)" with the actual number that I calculated, it worked for making the % column! – j681 Sep 15 '17 at 20:25
  • With this kind of data, you'll be creating a `barplot`, not a `histogram`, since you don't have the actual data. (For the latter, you have some control over bin widths, but in general `hist` and `geom_histogram` determine this automatically.) – r2evans Sep 15 '17 at 21:32

0 Answers0