I'm trying to build a program that reads an HTML file and translates certain content to pig latin (keeping the same case, all line breaks and all apostrophes). I want it to ignore anything inside HTML tags, numbers, punctuation and URLs.
I think I'm getting close, I'm just looking for hints on what library methods I should use and where I should do my translation.
I realize the replaceAll method is wrong. I hope there's something like replaceAll but "ignoreALL" that can ignore things I don't need translated.
Right now it takes a test.hmtl
with:
<sdhfusidgfhdsfiugdfhghds9fuighdsfigudsf>3423423 JONES!
and returns:
ONES!
3423423 Jay
I'd like it to return <sdhfusidgfhdsfiugdfhghds9fuighdsfigudsf>3423423 ONES!JAY
Here's what I have so far:
import java.io.*;
import java.util.Scanner;
import java.util.Formatter;
public class test {
private test() {}
public static void main (String[] args) throws Exception{
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("test.html"));
String line;
while ( (line=br.readLine()) != null) {
sb.append(line).append(System.getProperty("line.separator"));
}
String nohtml = sb.toString().replaceAll("\\<.*?>", "");
final String vowels = "aeiouAEIOUy";
String beforVowel = "";
int cut = 0;
while (cut < nohtml.length() && !vowels.contains("" + nohtml.charAt(cut)))
{
beforVowel += nohtml.charAt(cut);
cut++;
}
if (cut == 0)
{
cut = 1;
nohtml += nohtml.charAt(0) + "w";
}
System.out.println(nohtml.substring(cut) + beforVowel + "ay");
}
}
Thanks for any guidance.