How to use str_detect to filter top level domains in R?

Question

This is fairly straightforward, but I can't quite figure out how to make this code work. Probably better knowledge of regex would help me out.

I have a list of URLs, many of which are from domains that belong to countries outside the US. I would like to filter out the ones that fit a list of specific country codes. My list is based off a table found here: https://www.countries-ofthe-world.com/TLD-list.html

Taking my original list of URLs, I separated them out so that one column will be just the top level domain ending (.com, .net, etc..)

So then I want R to go through my list and detect all of the country URLs that I took from that list and filter those out. However, it doesn't seem to work the way I had hoped.

filtered_list <- df %>% filter(!str_detect(domain_ending, country$endings))

The idea is that it will take all the domain endings and keep the ones that don't match the ones from my list. I've tested a bunch of variations of this code, but I can't quite figure out why it's removing some .coms and others that aren't even in my list, and keeping .de and others that I know should be filtered.

Edit: Here's some fictional variations on example websites to help with the code

list <- c("Facebook.com", "Twitter.de", "Google.at", "Youtube.cn", "Instagram.fi", "Linkedin.com", "Wordpress.org", "Pinterest.au", "Wikipedia.org")

Supposing I wanted to take that list and filter out all the endings that show up on the list from that first table linked above, how would I go about this? There's something wrong with my code somewhere, so maybe this example can help. My variables are classified as characters. That might make a difference?

Edit2: Wrote a CSV file and re-uploaded it into R and now it works. Sorry to waste everyone's time. Thanks for everyone's help though.

Can you post a sample of the data? This is almost certainly a problem someone can help with if you post a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — camille, Aug 07 '18 at 15:35
I can't post a sample of the URLs since they're for work. But any list of URLs will do for the purposes of this code. Something like this https://moz.com/top500 — Michael Smith, Aug 07 '18 at 16:56

score 0 · Accepted Answer · answered Aug 07 '18 at 15:41

0

You can use which() to filter out these urls:

filtered_list <- df[which(!df$domain_ending%in% country$endings),]

answered Aug 07 '18 at 15:41

Arno

207
2
9

This code hasn't worked either. I don't understand why it hasn't because it seems like ti should. I am adding some examples so we can solve this: Websites: 'listing <- c("Facebook.com", "Twitter.de", "Google.at", "Youtube.cn", "Instagram.fi", "Linkedin.com", "Wordpress.org", "Pinterest.au", "Wikipedia.org")' Endings to avoid: 'endings <- (".fi", ".au", ".uk", ".at", ".de", ".cn")' – Michael Smith Aug 07 '18 at 17:39

score 0 · Answer 2 · answered Aug 09 '18 at 20:06

One way to solve is to create a pattern with pipe.

listing <- c("Facebook.com", "Twitter.de", "Google.at", "Youtube.cn", 
          "Instagram.fi", "Linkedin.com", "Wordpress.org", "Pinterest.au", "Wikipedia.org")

endings <- c(".fi", ".au", ".uk", ".at", ".de", ".cn")

pattern <- str_c(endings, collapse = '|')
grep(pattern, listing, value=T)

## > pattern
## [1] ".fi|.au|.uk|.at|.de|.cn"

## > grep(pattern, listing, value=T)
## [1] "Twitter.de"   "Google.at"    "Youtube.cn"   "Instagram.fi" "Pinterest.au"

How to use str_detect to filter top level domains in R?

2 Answers2