3

I have a string like this

Bodies of 5 /Irish/ immigrants /'murdered and killed by cholera' while building a railroad/ in 1832 to http://www.bbc.com/news/

I tried this to get rid of the slashes using the following

replaceAll("/","");

What I got was

Bodies of 5 Irish immigrants 'murdered and killed by cholera' while building a railroad in 1832 to http:www.bbc.comnews

I want to preserve the URL slashes but want to get rid of the other slashes in the text. Any suggestions will be very appreciated.

Mohan Timilsina
  • 490
  • 6
  • 25
  • split the string using String.split; then loop through the word, if not URL then remove "/" from them. You can check if URL valid, using URLValidator eg in [How to check for a valid URL in Java?](http://stackoverflow.com/questions/2230676/how-to-check-for-a-valid-url-in-java) – nafas Jul 07 '14 at 14:21
  • This might help you `\s/|/\s` that means any `/` having space before or after. OR make it more precise `(?<=\s)(/)|(/)(?=\s)`. Here is [DEMO](http://regex101.com/r/fE8mU2/1). considering url always in the end. Pass the regex in `replaceAll()` method. – Braj Jul 07 '14 at 14:22
  • There is no need to group `/` in case of replace so try `(?<=\s)/|/(?=\s)` – Braj Jul 07 '14 at 14:29

2 Answers2

2

That is a morbid example. Remember that Regexs just pick up on patterns, so the best one for you will depend on your data.

For example, in the string you provided, the regex: [^:/m]/ would suffice. However, that also ignores any slashes after an "m" in any text. That is not great, unless you know for a fact that all your slashes will not be after "m"s.

For this example, I would suggest separating the URL. If you know the URL will always be at the end, you can split the string and only run the replacement on the text, not the URL.

Something like this might work well for you.

String s is our morbid headline

String text = s.replace("http.*","");
String url = s.replace(".*http","http");
text = text.replace("/","");
text = text + " " + url;

This saves everything but the url to text, ONLY the url to url and then cleans text, and appends the url back to the end.

Adam Yost
  • 3,616
  • 23
  • 36
  • That is not true. As I mentioned in the text, this works whenever the url is at the end. However, with a little extra code, it can be adapted to restore the url to any location. – Adam Yost Jul 07 '14 at 14:28
  • that's what I said, I said it works for this example, but not generic for all – nafas Jul 07 '14 at 14:29
  • Once again, I started with that. Any solutions effectiveness is entirely dependent on the data format. With what we know about the data, this works. If we can see more about the format I will update. – Adam Yost Jul 07 '14 at 14:36
2

It seems that you want to remove only slashes which are at start or end of words. So such slashes need to

  • have space before
  • have space after
  • be placed at start of the string
  • be placed at end of the string

This approach has potentially one flaw which is removing last slash in URL address like http://www.some.address/ would become http://www.some.address.

If this is what you are looking for you can try with look-around mechanisms,

replaceAll("(?<=\\s|^)/|/(?=\\s|$)", "")

which will change

Bodies of 5 /Irish/ immigrants /'murdered and killed by cholera' 
while building a railroad/ in 1832 to http://www.bbc.com/news/

into

Bodies of 5 Irish immigrants 'murdered and killed by cholera' 
while building a railroad in 1832 to http://www.bbc.com/news
                                                            ^as you see it also 
                                                             removed last slash 
                                                             in this url

Way around of removing last / in URL problem would be make regex match URL first and replace it with itself. This will prevent slashes from this URL being matched (tested) again for having space or start-of-the-string before OR having space or end-of-the-string after it.
I mean regex in form

(matchesURL)|matchesSlashesAtStartOfWord|matchesSlashesAtEndOfWord

for such regex / matched by (matchesURL) will not be able to matched again by matchesSlashesAtStartOfWord|matchesSlashesAtEndOfWord.

So you can use something like

replaceAll("(https?://[^/]+(/[^/]+)*/?)|(?<=\\s|^)/|/(?=\\s|$)", "$1")

which will first match urls, put them into group 1 and replace them with content of group 1 $1. Since other cases of regex (?<=\\s|^)/|/(?=\\s|$) can't place anything in group 1, for them $1 will be empty so you will replace such / with nothing (you will remove them).

DEMO

String data = "Bodies of 5 /Irish/ immigrants /'murdered and killed by cholera' \r\nwhile building a railroad/ in 1832 to http://www.bbc.com/news/";
System.out.println(data);
System.out.println();
System.out.println(data.replaceAll("(https?://[^/]+(/[^/]+)*/?)|(?<=\\s|^)/|/(?=\\s|$)", "$1"));

Output

Bodies of 5 /Irish/ immigrants /'murdered and killed by cholera' 
while building a railroad/ in 1832 to http://www.bbc.com/news/

Bodies of 5 Irish immigrants 'murdered and killed by cholera' 
while building a railroad in 1832 to http://www.bbc.com/news/
Pshemo
  • 122,468
  • 25
  • 185
  • 269