Read HTML file, leave certain parts as-is and translate others

Question

I'm trying to build a program that reads an HTML file and translates certain content to pig latin (keeping the same case, all line breaks and all apostrophes). I want it to ignore anything inside HTML tags, numbers, punctuation and URLs.

I think I'm getting close, I'm just looking for hints on what library methods I should use and where I should do my translation.

I realize the replaceAll method is wrong. I hope there's something like replaceAll but "ignoreALL" that can ignore things I don't need translated.

Right now it takes a test.hmtl with:

<sdhfusidgfhdsfiugdfhghds9fuighdsfigudsf>3423423 JONES!

and returns:

ONES! 3423423 Jay

I'd like it to return <sdhfusidgfhdsfiugdfhghds9fuighdsfigudsf>3423423 ONES!JAY

Here's what I have so far:

import java.io.*;
import java.util.Scanner;
import java.util.Formatter;

public class test {

private test() {}

public static void main (String[] args) throws Exception{

 StringBuilder sb = new StringBuilder();

 BufferedReader br = new BufferedReader(new FileReader("test.html"));

 String line;

 while ( (line=br.readLine()) != null) {

     sb.append(line).append(System.getProperty("line.separator"));
 }

 String nohtml = sb.toString().replaceAll("\\<.*?>", "");


    final String vowels = "aeiouAEIOUy";


        String beforVowel = "";
        int cut = 0;
        while (cut < nohtml.length() && !vowels.contains("" + nohtml.charAt(cut)))
        {
            beforVowel += nohtml.charAt(cut);
            cut++;
        }
        if (cut == 0)
        {
            cut = 1;
            nohtml += nohtml.charAt(0) + "w";
        }
        System.out.println(nohtml.substring(cut) + beforVowel + "ay");


}

}

Thanks for any guidance.

@immibis i'd like to do it without downloading any outside parsers — GeorgeCostanza, Feb 15 '15 at 09:32
You'll find infinitely easier if you raise that constraint and do this job with XPath or XSLT. — user207421, Feb 15 '15 at 22:25

score -1 · Accepted Answer · edited May 23 '17 at 12:06

-1

You can split the content of your html file with regex look-ahead (?=subexpr) and look-behind (?<=subexpr) (effectively describing zero-length delimiters) into tags and non-tags, distinguishing between both groups with another regex <.*>.

// read file into StringBuilder
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("test.html"));
String line;
while ((line = br.readLine()) != null) {
    sb.append(line).append(System.getProperty("line.separator"));
}
String html = sb.toString();

// untangle tags and non-tags
String[] parts = html.split("(?<=>)|(?=<)");
for (int i = 0; i < parts.length; i++) {
    if (!parts[i].matches("<.*>")) {
        // translate words to pig latin
        parts[i] = parts[i].replaceAll(
            "\\b([AEOUIaeoui]+\\w*)\\b", "$1ay").replaceAll(
            "\\b([\\w&&[^AEOUIaeoui]]+)(\\w*?)\\b", "$2$1ay");
    }
}

// join parts back together
html = String.join("", parts);
System.out.println(html);

I don't know your exact variant of pig latin, but \\b([AEOUIaeoui]+\\w*)\\b matches everything surrounded by word boundaries \\b which starts with at least one vowel and ends with any word characters. This is replaced by the characters between ( and ) (the word) and "ay".

Then \\b([\\w&&[^AEOUIaeoui]]+)(\\w*?)\\b matches words starting with one or more word character except vowels followed by any word characters. ? means we want to capture as few characters as possible, so all consonants are captured in the first group. This is replaced by the second group followed by the first group followed by "ay".

The join function requires Java 8. If it has to work with a lower version, you need to do this on your own.

Note: This approach piggyfies script sections as well and sometimes fails if non-tag < and > are not properly escaped by < and >. E.g. <a href="#" title=">" class="special">a link</a> translates class="special", too.

edited May 23 '17 at 12:06

Community

1
1

answered Feb 15 '15 at 21:29

R2-D2

1,554
1
13
25

Could one of the downvoters drop a piece of explanation here? – R2-D2 Feb 15 '15 at 23:19
thanks for the help, but i'm not sure if this is going to work. i'm playing with it now, but i need access to the individual parts (words, fragments, urls, etc) in each line of the html file to see if they're eligible to be translated. my original post might be confusing, or maybe i'm doing something wrong – GeorgeCostanza Feb 16 '15 at 01:55
I added the actual translation to my answer -- check it out. – R2-D2 Feb 16 '15 at 11:49
alright this is awesome. i'm not really familiar with regex, but this is definitely what i was looking for. i'll play around with it, but i'm guessing I add everything i want to ignore inside the parentheses after split? so if i want to ignore numbers i'll create a regex for them and add them after a "|"? thanks! – GeorgeCostanza Feb 16 '15 at 18:14
No, this splits the string before `<` and after `>`. To ignore numbers (or more specifically only match "real" words) try to ensure every match starting with non-vowels contains at least one word character by replacing `(\\w*?)` with `(\\w+?)`. Don't hesitate to read some tutorials on regex, if you're not comfortable with them yet. – R2-D2 Feb 16 '15 at 21:56
okay i'm definitely doing that now. thanks for going out of your way to assist. looks like regex is pretty powerful – GeorgeCostanza Feb 16 '15 at 23:37
one last question, only because i've been trying so long that everything is starting to look like gibberish, haha. i can get bypass to display as ypassbay, but i'm trying to get words that begin with y to treat y as a consonant. for example: i'd like yellow to return ellowyay – GeorgeCostanza Feb 17 '15 at 06:59

Read HTML file, leave certain parts as-is and translate others

1 Answers1