0

my question is best asked in 2 parts:

I am dealing with a dataset that looks at forest product usage across many countries. Each row represents a household from any one of these countries (about 30 total). Each country has a code (4 digits), but in the dataset there is no column for country code. The way you can deduce which households came from which country is by using the household ID ("ghousehold"). Ghousecode is a 7-digit code, the first 4 digits being the country code. For example, if Bolivia were country code: 3024, then a household in Bolivia could be 3024105 or 3024999...

I want to have a code that selects all the entries for a specific country. I am using the tidyverse, so I thought of using select() and num_range() but it hasn't worked. I don't get an error message, but when I look at my output I can tell it hasn't worked. Here is my current code:

    #forest_use_tibble is a tibble with observations on forest usage from many countries
    #I selected a subset of the original file's variables. 

    forest_use_simpler <- select(forest_use_tibble, ghousecode, year, product, income, amount, unit)

    #take Bolivia, whose country ID is 3024. This means that each ghousecode that begins with 
     3024 is from Bolivia. 
    #but each ghousecode is 3024xxx with three other numbers after it.

    x = 3024
    Bolivia <- select(forest_use_simpler, num_range("x", 001:999), everything())

    #my goal: a new tibble/dataframe that has only the entries from Bolivia
    #there is no separate column for country ID, unfortunately.

Any ideas?

Second part of the question: Is there a way to query just one of the columns (i.e. variables, in this case ghousecode) for the num_range? The way I have it above strikes me like it would search all variables in forest_use_simpler, so there is a chance that it may include another country's household if the digits 3024 appeared somewhere other than ghousecode.

Thank you!

(note: i have also tried putting in 3024 directly where x is to no avail. Thanks again for all help.)

  • 2
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. We can't run your code without any data, and we can't see what output you're getting. It also helps (both us and you) to pare the question down to just what's needed to identify the problem – camille Dec 31 '19 at 03:47
  • Sorry about that camille! Thanks for the feedback. Am new to this and do appreciate it. I have tried to improve for my next question. Happy new year! – julianabanana Dec 31 '19 at 05:19
  • You can [edit] the question to include the necessary information – camille Dec 31 '19 at 14:00

1 Answers1

0

If the ghousecode is consistently formatted with 7 digits, how about something like this?

library(tidyverse)

df <-
  tibble(
    ghousecode = c(2039434, 3024105),
    year = c(2019, 2019)
  )

df %>% 
  mutate(country_code = floor(ghousecode / 1000)) %>% 
  filter(country_code == 3024)

select chooses columns, while filter chooses rows.

cardinal40
  • 1,245
  • 1
  • 9
  • 11
  • Thank you! This helped, seems like an easy fix to my issue now that you point it out--- still wondering about the use of num_range when you have a same-beginning 4 digits, but you solved my problem at hand and for that I thank you very much! :) – julianabanana Dec 31 '19 at 05:21