2

Since I started using R< not long ago, I've found this site very useful in helping me build my scripts. I have yet again came across a challenge for which I can't seem to find an answer anywhere. Here is my problem: In my data I have a column which contains a different URL in each row. In each of those URL's there is a particular piece of information I want to extract. Currently I do it in excel because I've been told it's impossible to do in R and that no function exists to do it.

The URL will look like this example format and it will be found in the "source" column

http://www.googleclick.com?utm_source=ADX&ID56789

http://www.googleclick.com?utm_source=ADW&ID56009

http://www.googleclick.com?utm_source=ADWords&ID53389

The part of the URL that is of importance to me is the "utm_source=ADX" bit .My data looks something like this:

User / Source

1 / http://www.googleclick.com?utm_source=ADX&ID56789

2 / http://www.googleclick.com?utm_source=ADW&ID56009

3 / http://www.googleclick.com?utm_source=ADWords&ID53389

What I need to do is to capture the utm_source from the URL and transpose the information into a different column, example below:

User / Source / utm_source

1 / googleclick / ADX&ID56789

2 / googleclick / ADW&ID56009

3 / googleclick / ADWords&ID53389

So in essence I need R to search in the entire dataframe for the value "utm_source=" and once it has found them, I want it to transpose the "utm_source=" value into a column name and to copy all the information that comes after "=" in a that column for each individual row. I know that "grep" is a function that locates a specific piece of information in the datafreme , for example data <- total[grepl("utm_source", total$Source), ]. This will give me all the rows that contain the word "utm_source" but what I need is the information that comes after " utm_source". Usually my data can have as many as 500.000 rows. At the moment I use the excel function "text to columns" for this, and I basically split the URL's into little bits and keep the columns that I need, but this can be a very messy and lengthy process.

Is there a way to modify the grepl function to meet the criteria I need?

oguz ismail
  • 1
  • 16
  • 47
  • 69
David Rogers
  • 141
  • 3
  • 11

2 Answers2

4

Nothing is impossible.

x <- read.csv(text="
User, Source
1, http://www.googleclick.com?utm_source=ADX&ID56789
2, http://www.googleclick.com?utm_source=ADW&ID56009
3, http://www.googleclick.com?utm_source=ADWords&ID53389
", header=TRUE, stringsAsFactors=FALSE)

First, use strsplit

strsplit(x$Source, split="\\?utm_source=")
[[1]]
[1] " http://www.googleclick.com" "ADX&ID56789"                

[[2]]
[1] " http://www.googleclick.com" "ADW&ID56009"                

[[3]]
[1] " http://www.googleclick.com" "ADWords&ID53389" 

Then find a red-hot poker and stick in the eye of your so-called advisor.


EDIT:

As suggested by Paul Hiemstra, you can also use a regular expression directly:

gsub(".*\\?utm_source=", "", x$Source)
[1] "ADX&ID56789"     "ADW&ID56009"     "ADWords&ID53389"
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • +1, however the regex based solution might be even more robust. – Paul Hiemstra Nov 19 '12 at 15:20
  • Funny that someone would say that something is not possible in R, but is in Excel. I only know of examples of the other way around. – Paul Hiemstra Nov 19 '12 at 15:22
  • Thanks so much for the prompt answer. You guys really know your stuff. The strepsplit does not seem to work for me Error in strsplit(file1$Source, split = "\\?utm_source=") : non-character argument. No idea what the error means. The gsub seems to work, but it creats my new subset as values so I can't write it as csv. Do I need to add something like as.character or something? – David Rogers Nov 19 '12 at 15:40
  • @DavidRogers Your first error is caused by the fact that your data is a factor, not a character. Use `stringsAsFactors=FALSE` when you `read.csv()` as I show in my code. In both cases, you need to assign the result of the split operation to a column in your data frame. – Andrie Nov 19 '12 at 15:43
  • file1 <- read.csv("C:/Users/David Rogers/Desktop/Users.csv") So this is my file. How do I read is as character? I've never read my files this way before, so not sure where to put the extension. – David Rogers Nov 19 '12 at 15:48
  • Sorry, I am a beginner with R – David Rogers Nov 19 '12 at 15:49
  • Study the code example I posted. Look out for the bit `stringsAsFactors=FALSE`. – Andrie Nov 19 '12 at 15:50
  • Got it. file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv", stringsAsFactors=FALSE) Thanks – David Rogers Nov 19 '12 at 15:56
  • However, what if I wanted to copy not everything that is after "utm-source="? What if from http://www.googleclick.com?utm_source=ADX&ID56789 I only wanted what is after= and before & so in other words just the "ADW" How would I need to modify the formula? – David Rogers Nov 19 '12 at 16:05
  • `gsub(".*\\?", "", x$Source)` – Andrie Nov 19 '12 at 16:10
  • Sorry guys, I just can't seem to be able to follow up on either suggestions. They just don't seem to work in my case. It's just my low understanding of R programming. But thank you anyway. – David Rogers Nov 19 '12 at 16:17
1

@Andrie's answer does the trick. Here's another way using using regmatches and gregexpr that might be useful.

d <- read.table(text="User URL
1 http://www.googleclick.com?utm_source=ADX&ID56789
2 http://www.googleclick.com?utm_source=ADW&ID56009
3 http://www.googleclick.com?utm_source=ADWords&ID53389", header=TRUE)

domain.pat <- '((?<=www.)([[:alnum:]_]+))'
source.pat <- '((?<=utm_source=)([[:alnum:]&]+))' # exclude the '&' here to only grab up to the '&'
all.matches <- gregexpr(paste(domain.pat, source.pat, sep='|'), d$URL, perl=TRUE)
all.substrings <- regmatches(d$URL, all.matches)
do.call(rbind, all.substrings)

#      [,1]          [,2]             
# [1,] "googleclick" "ADX&ID56789"    
# [2,] "googleclick" "ADW&ID56009"    
# [3,] "googleclick" "ADWords&ID53389"
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • Hi mplourde. Could you please explain how is it that Andrie's sugggestion worked when you tried it? In my case it did not. Yours works up until the point of do.call(rbind, all.substrings). When I try to subset this (because I need to export it as csv) it gives me a character matrix. – David Rogers Nov 19 '12 at 16:31
  • So, just to be clear, my desired output is a subset which contains the following columns: Col1- userID, Col2-utm_source (and inside each row of this column there shold be on each row whatever was initially present inside the URL between = and & (ADW or ADX or ADWords) . I hope that makes sense :) – David Rogers Nov 19 '12 at 16:32
  • @David Apparently in your `data.frame` the URLs are of `type` factor. They need to be `character` for this to work. You can either follow Andrie's advice of using `stringsAsFactors=FALSE` when reading the data, or replace `d$URL` with `as.character(d$URL)`, in the example I provided. – Matthew Plourde Nov 19 '12 at 16:37
  • can you give me the formula as to how your version should look like? Basically what is the function to transform a data.frame into a character? If I have a file named "Users" how do I transform it. I know that this is probably incredibly basic for you, but for someonw who knows very little of R, everything is a challenge. :) – David Rogers Nov 19 '12 at 16:50
  • I tried this as.character(Users$URL) but it just shows me 1000 rows of data and says [ reached getOption("max.print") -- omitted 191016 entries ] – David Rogers Nov 19 '12 at 16:51
  • @David, to convert the URL column to `character`, try `Users <- transform(Users, URL = as.character(URL))`. – Matthew Plourde Nov 19 '12 at 16:53
  • thank you so much mplourde. I will try again with the approach you suggested. Thanks again. All the best" David – David Rogers Nov 19 '12 at 17:04