307

I want to replace whitespace with underscore in a string to create nice URLs. So that for example:

"This should be connected" 

Should become

"This_should_be_connected" 

I am using Python with Django. Can this be solved using regular expressions?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Lucas
  • 13,679
  • 13
  • 62
  • 94
  • 1
    How can this this be achieved in django template. Is there any way to **remove** white spaces. Is there any built in tag/filter to do this? Note: `slugify` doesn't gives the desired output. – user1144616 Mar 12 '12 at 17:52

14 Answers14

520

You don't need regular expressions. Python has a built-in string method that does what you need:

mystring.replace(" ", "_")
Luke Exton
  • 3,506
  • 2
  • 19
  • 33
rogeriopvl
  • 51,659
  • 8
  • 55
  • 58
  • 54
    This doesn't work with other whitespace characters, such as \t or a non-breaking space. – Roberto Bonvallet Jun 17 '09 at 15:49
  • 15
    Yes you are correct, but for the purpose of the question asked, it doesn't seem necessary to take those other spaces into account. – rogeriopvl Jun 17 '09 at 17:39
  • 2
    do I need to import anything for this to work? I get the following error: AttributeError: 'builtin_function_or_method' object has no attribute 'replace' – Ocasta Eshu Oct 31 '12 at 02:23
  • 2
    Probably the variable that you called replace on, was not a string type. – Snigdha Batra Aug 10 '15 at 09:10
  • 17
    This answer could be confusing, better write it as mystring = mystring.replace(" ", "_") as it doesn't directly alter the string but rather returns a changed version. – Mehdi Nov 02 '18 at 09:37
  • 1
    Thanks, This is working. Would be better if written like: `mystring = mystring.replace(" ", "_")` – Chandan Sharma Oct 09 '19 at 11:15
  • 5
    does not work with non-breaking spaces, use `re.sub(r"\s+", '', content)` instead – Macbric Nov 09 '19 at 16:44
  • **Whitespace >>>> literal space character.** The question title explicitly states "whitespace"; the question content then reiterates that. As @Macbric suggests, `re.sub(r'\s+', '_', content)` is the canonical solution. *And yet again the best answer is a comment with two upvotes.* `` – Cecil Curry Mar 31 '22 at 05:24
109

Replacing spaces is fine, but I might suggest going a little further to handle other URL-hostile characters like question marks, apostrophes, exclamation points, etc.

Also note that the general consensus among SEO experts is that dashes are preferred to underscores in URLs.

import re

def urlify(s):

    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

# Prints: I-cant-get-no-satisfaction"
print(urlify("I can't get no satisfaction!"))
vvvvv
  • 25,404
  • 19
  • 49
  • 81
Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • This is interesting. I will definitely use this advice. – Lucas Jun 17 '09 at 15:08
  • Remember to urllib.quote() the output of your urlify() - what if s contains something non-ascii? – zgoda Jun 19 '09 at 07:17
  • 1
    This is nice - but the first RE with \W will *also remove whitespace* with the result that the subsequent RE has nothing to replace... If you want to replace your other characters with '-' between tokens have the first RE replace with a single space as indicated - i.e. s = re.sub(r"\W", '&nbsp', s) (this may be a shonky formatting issue on StackOverflow: http://meta.stackexchange.com/questions/105507/how-to-add-a-space-in-the-code-section) – timlukins Jun 12 '12 at 15:45
  • @TimTheEnchanter - good catch. Fixed. What is the airspeed velocity of an unladen swallow? – Kenan Banks Jun 12 '12 at 17:00
  • 2
    @Triptych What do you mean? African or European swallow? – timlukins Jun 13 '12 at 10:57
  • 1
    Another slight problem with this is you remove any preexisting hyphens in the url, so that if the user had attempted to clean the url string before uploading to be this-is-clean, it would be stripped to thisisclean. So s = re.sub(r'[^\w\s-]', '', s). Can go one step further and remove leading and trailing whitespace so that the filename doesn't end or start with a hyphen with s = re.sub(r'[^\w\s-]', '', s).strip() – Intenex Jul 17 '12 at 05:12
  • `re.sub('[%s\s]+' % '-', '-', s)` this works for utf-8 chars – Iliyass Hamza May 29 '19 at 09:58
54

This takes into account blank characters other than space and I think it's faster than using re module:

url = "_".join( title.split() )
xOneca
  • 842
  • 9
  • 20
  • 4
    More importantly it will work for any whitespace character or group of whitespace characters. – dshepherd May 08 '13 at 15:17
  • This solution does not handle all whitespace characters. (e.g. [`\x8f`](http://www.charbase.com/008f-unicode-single-shift-three)) – Lokal_Profil Dec 05 '16 at 14:43
  • Good catch, @Lokal_Profil! The [documentation](https://docs.python.org/2/library/stdtypes.html#str.split) doesn't specify which whitespace chars are taken into account. – xOneca Dec 06 '16 at 07:39
  • 1
    This solution will also not preserve repeat delimiters, as split() does not return empty items when using the default "split on whitespace" behavior. That is, if the input is "hello,(6 spaces here)world", this will result in "hello,_world" as output, rather than "hello,______world". – FliesLikeABrick Dec 04 '18 at 14:29
  • 2
    regexp > split/join > replace – Utku Cansever Mar 17 '22 at 12:23
  • This is especially helpful, if you want to replace any number of whitespace characters with 1 character. Like in "reducing all whitespace to 1 space" szenarios. Very handy to remove line breaks, tabs and multi-` ` etc. from a string. – CodingCat May 03 '23 at 13:28
45

Django has a 'slugify' function which does this, as well as other URL-friendly optimisations. It's hidden away in the defaultfilters module.

>>> from django.template.defaultfilters import slugify
>>> slugify("This should be connected")

this-should-be-connected

This isn't exactly the output you asked for, but IMO it's better for use in URLs.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • That is an interesting option, but is this a matter of taste or what are the benefits of using hyphens instead of underscores. I just noticed that Stackoverflow uses hyphens like you suggest. But digg.com for example uses underscores. – Lucas Jun 17 '09 at 15:23
  • This happens to be the preferred option (AFAIK). Take your string, slugify it, store it in a SlugField, and make use of it in your model's get_absolute_url(). You can find examples on the net easily. – shanyu Jun 17 '09 at 16:13
  • 3
    @Lulu people use dashes because, for a long time, search engines treated dashes as word separators and so you'd get an easier time coming up in multi-word searches. – James Bennett Jun 19 '09 at 20:08
  • @Daniel Roseman can I use this with dynamically variable. as I am getting dynamic websites as string in a veriable – ephemeral Aug 22 '17 at 07:30
  • This is the right answer. You need to sanitize your URLs. – kagronick May 06 '18 at 16:40
  • This doesn't work with utf-8 characters, I tested it with Arabic and it returned "" empty string – Iliyass Hamza May 29 '19 at 09:51
29

Using the re module:

import re
re.sub('\s+', '_', "This should be connected") # This_should_be_connected
re.sub('\s+', '_', 'And     so\tshould this')  # And_so_should_this

Unless you have multiple spaces or other whitespace possibilities as above, you may just wish to use string.replace as others have suggested.

Jarret Hardie
  • 95,172
  • 10
  • 132
  • 126
  • Thank you, this was exactly what I was asking for. But I agree, the "string.replace" seems more suitable for my task. – Lucas Jun 17 '09 at 14:59
  • 1
    PEP8 replace this `'\s+'` with this `r'\s+'`. Info: https://www.flake8rules.com/rules/W605.html – mrroot5 Oct 18 '21 at 16:58
11

use string's replace method:

"this should be connected".replace(" ", "_")

"this_should_be_disconnected".replace("_", " ")

mdirolf
  • 7,521
  • 2
  • 23
  • 15
8

You can try this instead:

mystring.replace(r' ','-')
Jules Dupont
  • 7,259
  • 7
  • 39
  • 39
Meghaa Yadav
  • 81
  • 1
  • 1
7

Python has a built in method on strings called replace which is used as so:

string.replace(old, new)

So you would use:

string.replace(" ", "_")

I had this problem a while ago and I wrote code to replace characters in a string. I have to start remembering to check the python documentation because they've got built in functions for everything.

7

Surprisingly this library not mentioned yet

python package named python-slugify, which does a pretty good job of slugifying:

pip install python-slugify

Works like this:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a") 
Yash
  • 6,644
  • 4
  • 36
  • 26
5

I'm using the following piece of code for my friendly urls:

from unicodedata import normalize
from re import sub

def slugify(title):
    name = normalize('NFKD', title).encode('ascii', 'ignore').replace(' ', '-').lower()
    #remove `other` characters
    name = sub('[^a-zA-Z0-9_-]', '', name)
    #nomalize dashes
    name = sub('-+', '-', name)

    return name

It works fine with unicode characters as well.

Armandas
  • 2,276
  • 1
  • 22
  • 27
4
mystring.replace (" ", "_")

if you assign this value to any variable, it will work

s = mystring.replace (" ", "_")

by default mystring wont have this

timbre timbre
  • 12,648
  • 10
  • 46
  • 77
Rajesh
  • 41
  • 2
2

OP is using python, but in javascript (something to be careful of since the syntaxes are similar.

// only replaces the first instance of ' ' with '_'
"one two three".replace(' ', '_'); 
=> "one_two three"

// replaces all instances of ' ' with '_'
"one two three".replace(/\s/g, '_');
=> "one_two_three"
skilleo
  • 2,451
  • 1
  • 27
  • 34
0
x = re.sub("\s", "_", txt)
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • 1
    There are already 2 answers suggesting the match pattern `\s+` for the replacement. Can you explain what are the differences/benefits of using your pattern (without the +)? – Tomerikoo Nov 28 '21 at 08:33
  • When you use `\s`,it will matches only for a single whitespace character.But when you use `\s+`,it will match one or more occurrence of white spaces. Example: `text = "hi all"` `re.sub("\s", "_", text)` `re.sub("\s+", "_", text)` Output for \s : hi__all Output for \s+ : hi_all With this example `\s` will mach only for one space.So when two white spaces is given consecutive on input,it will replace `_` for each of them.On other hand `\s+` will match for one space.So it will replace one `_` for each consecutive white spaces. – Allen Jose Nov 28 '21 at 12:29
  • 2
    @AllenJose, all of that belongs in your _answer_, not in a comment. Comments can be deleted at any time for any reason. Please read [answer]. – ChrisGPT was on strike Nov 28 '21 at 13:25
  • @AllenJose My comment was more to encourage you to [edit] your answer with that information. I personally know that difference but actually am not sure what is the benefit of your way. It seems to make more sense to replace multiple consecutive spaces with a ***single*** underscore. Again, this should be explained in your answer to help other readers make the distinction according to their needs – Tomerikoo Nov 28 '21 at 14:49
-4
perl -e 'map { $on=$_; s/ /_/; rename($on, $_) or warn $!; } <*>;'

Match et replace space > underscore of all files in current directory