1

I am using DB Browser for SQLite.

I have a table called 'df' and a column with a bunch of URL links called 'links'. All the links are from the same website and follow the same structured format:

https://'website name'/'language'/'type of content'/'content id'/'name of content'

Few examples of language: 1. en_au 2. fr_ca 3. en_us etc..

Also there is only one unique value in the 'type of content' portion of the link.

Goal: I need help extracting all the content ids from the link.

I have tried regex, but this pull all the information that matches the regex instead of the content id. I have also used the substr function (see code below), but this raises a few issues: 1. Some of the links don't begin with 'www.' 2. Not all the content ids are the same character length. 3. Not all of the language ids are the same character length.

SELECT
    substr(links, 36, instr(links, '/') +1) AS content_id
FROM df
beeeZeee
  • 51
  • 1
  • 2
  • 10
  • Sample data, what you have tried, desired output. – user14063792468 Jul 19 '19 at 16:50
  • Sorry if it is not clear in my post, but I have tried the code I have pasted above (substr). My desired output is to extract the 'content_id' from the url link sample I have posted above. I prefer a method that looks for values between the 5th and 6th '/', if that makes sense. Maybe something similar to a left or right function in excel. – beeeZeee Jul 19 '19 at 16:52
  • You may find this SO answer useful: https://stackoverflow.com/a/44586265/8339821 – user14063792468 Jul 19 '19 at 17:12
  • The `regexp` pattern is https://regex101.com/r/MWR4kt/1. Then you do `select regexp_replace(column, pattern, '\1')`. – user14063792468 Jul 19 '19 at 17:16
  • Is there a way to create a new column as opposed to replacing the original column? – beeeZeee Jul 19 '19 at 17:20
  • Do you mean `insert` a new column from result set ? Just do a proper `insert` statement. `select` ***is*** creating new columns in some sense, as a result is a table. – user14063792468 Jul 19 '19 at 17:28
  • I am wondering if there is a different solution. The one you provided requires knowledge of compiling and installing sqlite extensions. – beeeZeee Jul 19 '19 at 17:37
  • I'm not willing to guess your data layout. Please read [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) – user14063792468 Jul 19 '19 at 17:44
  • I provided you the url link layout above.. Is there some confusion about it? Let me know and I can further explain. We can say that the content are blog posts. For example: https://www.blogposts.com/en_us/blogpost/236353/how_to_write_a_blog_post **The example is removing the https:// and www. for some reason, but it is in the example link. – beeeZeee Jul 19 '19 at 17:59

2 Answers2

1

String Manipulation? I used the string 'Type of content' with PatIndex() to find where that string exists and then CharIndex() to find the starting and ending slashes for the content id. Then Substring() to pull the string out.

    Create Table #links
    (
    link VarChar(200)
    )
    Insert Into #links Values
    ('''https://''website name''/''language''/''type of content''/''en_au''/''name of content'''),
    ('''https://''website name''/''language''/''type of content''/''fr_ca''/''name of content'''),
    ('''https://''website name''/''language''/''type of content''/''en_us''/''name of content''')

Query

    Select SubString
          (
           link,
           PatIndex('%type of content%',link)+17,
           CharIndex('/',link,PatIndex('%type of content%',link)+17)-CharIndex('/',link,PatIndex('%type of content%',link))-1
          )
           As YourNewColumn
    From #links

Result

YourNewColumn
'en_au'
'fr_ca'
'en_us'
level3looper
  • 1,015
  • 1
  • 7
  • 10
0

Figured it out. This is the answer for Sqlite:

SELECT
 substr(links, instr(links, 'type of content') + 11, 6) AS content_id
FROM df

if you have one type of content, like in my case i only have blog_posts, then you can use 10. The number 10 allows you to skip ahead of "blog_posts/" in the url link to the content_id. Since all my content ids are 6 characters long, I put in 6 to capture only 6 values.

beeeZeee
  • 51
  • 1
  • 2
  • 10