0

I am gathering data using Jsoup from a webpage that includes a php script linked to my database. The data that I am getting includes Navigational Coordinates like this: 51°42’.41N 004° 54’.61W

The data displays correctly on the webpage but when I parse it using jsoup and insert the resulting strings into my app they include the Replacement Character U+FFFD � at certain points in the string...like this:

51�42�.41N 004� 54�.61W

I can remove those specials by using this:

.replaceAll("\uFFFD", "")

However this then results in this:

51 42 .41N 004 54 .61W

This isn't very desirable as these are navigational coordinates.

Is Jsoup responsible for this or is it purely that Android cannot display these characters?
Is it possible to 'catch' those characters before they are made into the � so I could match them with something similar that Android would display?

For Example the character displayed in the Navigational coordinates is a "Ordinal" symbol º and I could replace it with a "degree" symbol ° .

Additional: Code I am using to collect the Data:

 //Get the Notices to Mariners Amount
            Element ntmNumber = tableRows.select("td:eq(0)").last();
            String ntmAmt = ntmNumber.text();

            System.out.println("NtmLoadingTask is Running");

            //In-case Data does not exist...
            if (tableRows != null) {//Exists...

                //Convert Ntm Number to int for Gathering the Ntm List
                int ntmInt = Integer.parseInt(ntmAmt);
                for (int i = 0; i < ntmInt; i++) {

                    //Get Ntm Titles
                    Elements titles = tableRows.select("td:eq(1)");
                    String ntmTitle = titles.get(i).text() + "\n";
                    arr_dataNtmTitles.add(ntmTitle);


                    //Get Ntm Dates
                    Elements dates = tableRows.select("td:eq(2)");
                    String ntmDates = dates.get(i).text() + "\n";
                    arr_dataNtmDates.add(ntmDates);

                    //Get Ntm Content
                    Elements contents = tableRows.select("td:eq(3)");
                    String ntmContent = contents.get(i).text().replaceAll("\uFFFD", "") + "\n";

                    arr_dataNtmContents.add(ntmContent);

                    System.out.println(ntmContent);
                }

Update 1:

I have tried: .replaceAll("\u00BA", "\u00B0") with no success :(

Update 2:

I have gone back to the original Java code that I wrote to collecting the data and insert it into the database, I have used the following to replace the unwanted characters:

 content = Content.text().replaceAll("[º°]", "°") +"\n";

and verified that it is doing its job by doing this:

 content = Content.text().replaceAll("[º°]", "*") +"\n";

it is definitely working and is replacing the "ordinal" symbol with what I thought android would accept (a Degree symbol = °) but I am STILL getting this:

51�42�.41N  004� 54�.61W

Also this perhaps is important to finding a solution and I hadn't noticed it before (concentrating on the Ordinal symbol) but I am also getting the � at various other places in the strings, like this:

NO. 41�� OF 2014 Dock Lock Works 1.�MARINERS ARE HEREBY ADVISED....
and

Mariners are hereby advised that the deployment of �fire wires' is.....

From this I can see that some are clearly meant to be a " space " (there are meant to be 2 spaces after the 41) and some are meant to be an ' apostrophe. So I could really use some help on this, I have tried cleaning out the bad characters before inserting them into the database and after parsing them from the PHP page (on the page they appear as the should do) to no avail. Is there something I'm missing as when parsing other pages with jsoup I don't get this problem and I am thinking now that it is less to do with androids inability to display the characters and more to do with how they are inserted or coming out of the database? it is like it is filtering out SQL Injection or something with the removing of Apostrophes and alike??

PHP Script:

<?php

header('Content-Type: text/html; charset=utf-8');

$con=mysqli_connect("******","*******","*******","*******");
// Check connection
if (mysqli_connect_errno())
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}

$result = mysqli_query($con,"SELECT * FROM **********");

echo "<table border='1' title='table1'>
<title>HTML Table With PHP</title>
<caption>*************</caption>
<tr>
<th>NTM ID</th>
<th>NTM TITLE</th>
<th>NTM DATE</th>
<th>NTM CONTENT</th>
</tr>";

while($row = mysqli_fetch_array($result))
{
echo "<tr>";
echo "<td>" . $row['ntmID'] . "</td>";
echo "<td>" . $row['ntmTitle'] . "</td>";
echo "<td>" . $row['ntmDate'] . "</td>";
echo "<td>" . $row['ntmContent'] . "</td>";
echo "</tr>";
}
echo "</table>";

mysqli_close($con);
?>
J4C3N-14
  • 686
  • 1
  • 13
  • 32
  • Post the code you're using to retrieve the string from jsoup. – Kevin Coppock Aug 28 '14 at 21:51
  • @kcoppock Hi, I have updated the question with the code.. – J4C3N-14 Aug 29 '14 at 08:04
  • It looks like the page isn't specifying the right character encoding. I suspect you'll find your answer here: http://stackoverflow.com/q/7703434/20938 – Alan Moore Aug 31 '14 at 07:15
  • @AlanMoore thanks for looking at the question I think it may be leading me in the right direction but since I am in control of the mysql database and the php script that is returning the data should I not be able to control the output? I have read up a little further and have now included a "header" tag in the script specifying the charset (I wasn't before) BUT when I applied it the same � appeared on the webpage where as before all the characters rendered normally. I have now included the php script in my question as well, would you take a look at it? Note: the Mysql is also UTF-8 Thanks – J4C3N-14 Sep 01 '14 at 23:46

1 Answers1

0

Changing the charset to: charset=ISO-8859-1 in my PHP has stopped the undesired behaviour.

header('Content-Type: text/html; charset=ISO-8859-1');
J4C3N-14
  • 686
  • 1
  • 13
  • 32