0

so I have a large list of websites and I want to put them all in a String variable. I know I can not individually go to all of the links and escape the //, but is there is over a few hundred links. Is there a way to do a "block escape", so everything in between the "block" is escaped? This is an example of what I want to save in the variable.

String links="http://website http://website http://website http://website http://website http://website"

Also can anyone think of any other problems I might run into while doing this?

I made it htp instead of http because I am not allowed to post "hyperlinks" according to stack overflow as I am not at that level :p

Thanks so much

Edit: I am making a program because I have about 50 pages of a word document that is filled with both emails and other text. I want to filter out just the emails. I wrote the program to do this which was very simple, not I just need to figure away to store the pages in a string variable in which the program will be run on.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
vegetablelasagna
  • 29
  • 1
  • 2
  • 6
  • 1
    **why** in a single string, and not in a fixed size `String[]` with one link per index, or a dynamic `java.util.List` ? – jlordo Dec 13 '12 at 00:20
  • 6
    You don't need to escape *forward* slashes, only backslashes need escaping. – Sergey Kalinichenko Dec 13 '12 at 00:20
  • 1
    And what do you mean by "escape" here? Do you mean prefix with protocol (i.e. add `"http://"` to `"website website website"`? (as @dasblinkenlight says, if you already have `"http://website"`, it does not need any escaping, as in inserting escape characters like `\`.) – Amadan Dec 13 '12 at 00:20
  • 1
    @vege Show some expected output. and some code that you have tried. – Smit Dec 13 '12 at 00:22
  • unfortunately I can not post my "links" but one minute – vegetablelasagna Dec 13 '12 at 00:23
  • Ok so here is my problem. I have a word doc and a bunch of words on a word doc and when I copy it into my string it doesn't all save in the variable after I skip a line. When the line a line is skipped in my word paragraph, its stops putting it in the variable. Do you guys understand what I am talking about? – vegetablelasagna Dec 13 '12 at 00:26
  • so I have: "a line" a line space "a line" – vegetablelasagna Dec 13 '12 at 00:29
  • So I can't make a string that extends multple lines? – vegetablelasagna Dec 13 '12 at 00:32
  • I am making this program because I have about 50 pages of a word document that is filled with both emails and other text. I want to filter out just the emails. I wrote the program to do this which was very simple, not I just need to figure away to store the pages in a string variable. – vegetablelasagna Dec 13 '12 at 00:40
  • @vegetablelasagna Why not just read in the string/data from a file? The file could even be packaged into the same JAR. This would: make the data easily changeable and avoid needing to escape (or not escape) the string literals (as well as likely resulting in a cleaner and easier to deal with design in general). Java string literals are quite boringly simple and lack such verbatim and here-doc syntax found in other languages. –  Dec 13 '12 at 00:42
  • 1
    @vegetablelasagna you have a problem, you're using word, now you have more problems. – dlamblin Dec 13 '12 at 03:11

4 Answers4

2

Your question is not well-written. Improve it, please. In its current format it will be closed as "too vague".

Do you want to filter e-mails or websites? Your example is about websites, you text about e-mails. As I don't know and I decided to try to help you anyway, I decided to do both.

Here goes the code:

private static final Pattern EMAIL_REGEX =
        Pattern.compile("[A-Za-z0-9](:?(:?[_\\.\\-]?[a-zA-Z0-9]+)*)@(:?[A-Za-z0-9]+)(:?(:?[\\.\\-]?[a-zA-Z0-9]+)*)\\.(:?[A-Za-z]{2,})");

private static final Pattern WEBSITE_REGEX =
        Pattern.compile("http(:?s?)://[_#\\.\\-/\\?&=a-zA-Z0-9]*");

public static String readFileAsString(String fileName) throws IOException {
    File f = new File(fileName);
    byte[] b = new byte[(int) f.length()];
    InputStream is = null;
    try {
        is = new FileInputStream(f);
        is.read(b);
        return new String(b, "UTF-8");
    } finally {
        if (is != null) is.close();
    }
}

public static List<String> filterEmails(String everything) {
    List<String> list = new ArrayList<String>(8192);
    Matcher m = EMAIL_REGEX.matcher(everything);
    while (m.find()) {
        list.add(m.group());
    }
    return list;
}

public static List<String> filterWebsites(String everything) {
    List<String> list = new ArrayList<String>(8192);
    Matcher m = WEBSITE_REGEX.matcher(everything);
    while (m.find()) {
        list.add(m.group());
    }
    return list;
}

To ensure that it works, first lets test the filterEmails and filterWebsites method:

public static void main(String[] args) {
    System.out.println(filterEmails("Orange, pizza whatever else joe@somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (luke@starwars.com) hfkjdsh fhdsjf jdhf Paulo <aaa.aaa@bgf-ret.com.br>"));
    System.out.println(filterWebsites("Orange, pizza whatever else joe@somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (http://luke.starwars.com/force) hfkjdsh fhdsjf jdhf Paulo <https://darth.vader/blackside?sith=true&midclorians> And the http://www.somewhere.com as x."));
}

It outputs:

[joe@somewhere.com, luke@starwars.com, aaa.aaa@bgf-ret.com.br]
[http://luke.starwars.com/force, https://darth.vader/blackside?sith=true&midclorians, http://www.somewhere.com]

To test the readFileAsString method:

public static void main(String[] args) {
    System.out.println(readFileAsString("C:\\The_Path_To_Your_File\\SomeFile.txt"));
}

If that file exists, its content will be printed.

If you don't like the fact that it returns List<String> instead of a String with items divided by spaces, this is simple to solve:

public static String collapse(List<String> list) {
    StringBuilder sb = new StringBuilder(50 * list.size());
    for (String s : list) {
        sb.append(" ").append(s);
    }
    sb.delete(0, 1);
    return sb.toString();
}

Sticking all together:

String fileName = ...;
String webSites = collapse(filterWebsites(readFileAsString(fileName)));
String emails = collapse(filterEmails(readFileAsString(fileName)));
0

I suggest that you save your Word document as plain text. Then you can use classes from the java.io package (such as Scanner to read the text).

To solve the issue of overwriting the String variable each time you read a line, you can use an array or ArrayList. This is much more ideal than holding all the web addresses in a single String because you can easily access each address individually whenever you like.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
0

For your first problem, take all the text out of word, put it in something that does regular expressions, use regular expressions to quote each line and end each line with +. Now edit the last line and change + to ;. Above the first line write String links =. Copy this new file into your java source. Here's an example using regexr.

To answer your second question (thinking of problems) there is an upper limit for a Java string literal if I recall correctly 2^16 in length.

Oh and Perl was basically written for you to do this kind of thing (take 50 pages of text and separate out what is a url and what is an email)... not to mention grep.

dlamblin
  • 43,965
  • 20
  • 101
  • 140
-1

I'm not sure what kind of 'list of websites' you're referring to, but for eg. a comma-separated file of websites you could read the entire file and use the String split function to get an array, or you could use a BufferedReader to read the file line by line and add to an ArrayList.

From there you can simply loop the array and append to a String, or if you need to:

do a "block escape", so everything in between the "block" is escaped

You can use a Regular Expression to extract parts of each String according to a pattern:

String oldString = "<someTag>I only want this part</someTag>";
String regExp = "(?i)(<someTag.*?>)(.+?)(</someTag>)";
String newString = oldString.replaceAll(regExp, "$2");

The above expression would remove the xml tags due to the "$2" which means you're interested in the second group of the expression, where groups are identified by round brackets ( ). Using "$1$3" instead should then give you only the surrounding xml tags.

Another much simpler approach to removing certain "blocks" from a String is the String replace function, where to remove the block you could simply pass in an empty string as the new value.

I hope any of this helps, otherwise you could try to provide a full example with you input "list of websites" and the output you want.

JGaarsdal
  • 241
  • 1
  • 6