0

I am trying to get some data from the webpage https://www.thpa.gr/index.php/en/services-3/search-ek

which basicaly you put the number of a container (for example OOLU0198315) and returns if the container is unloaded and some other informations. My problem is that as far I can understand this is made in iframe (or javascript) and it doesn't contain the data in the web page as code.

For example if you search the OOLU0198315 it returns the following data

<tr bgcolor="#fafafa"> 
<td style="padding:7px">OOLU0198315</td>
<td style="padding:7px">781442-1</td>
<td style="padding:7px">ΦΟΡΤΩΣΗ</td>
<td style="padding:7px">Nov 24 2020 11:04:26:217AM</td>
<td style="padding:7px">Δεν εκδόθηκε τιμολόγιο</td></tr>

Which doesnt contain any id or class to get the data based on Xpath or based on id.

I tried to get those data based on a previous question How can I scrape a table that is created with JavaScript in c#

but I couldnt follow the same solution. I tried with Selenium and HtmlAgilityPack but there is no Xpath to get the data. Is there any other way to get those informations?

My code so far with HtmlAgilityPack

WebClient webClient = new WebClient();
        string page = webClient.DownloadString("https://www.thpa.gr/index.php/en/services-3/search-ek");

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(page);

        List<List<string>> table = doc.DocumentNode.SelectSingleNode("/html/body/div/table/tbody/tr[2]")
                    .Descendants("tr")
                    .Skip(1)
                    .Where(tr => tr.Elements("td").Count() > 1)
                    .Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
                    .ToList();

And with selenium

      using (var driver = new ChromeDriver())
                {
                    driver.Navigate().GoToUrl("https://www.thpa.gr/index.php/en/services-3/search-ek");               
                    var containerInfo = driver.FindElementById("I dont have Id");
                }
rippergr
  • 182
  • 2
  • 20
  • why don't you call directly use iframe source url which is https://portal.thpa.gr/fnet5/track/index.php – coder_b Nov 25 '20 at 18:26
  • @coder_b how this is different from the initial page? The data in index.php is the same format as https://www.thpa.gr/index.php/en/services-3/search-ek – rippergr Nov 25 '20 at 19:10

1 Answers1

2

All I was saying rather using parent url for data extraction, you could still access the content you want using iframe source path

something like this you could implement to extract data what is required, this code may require some refactoring but it gives you an idea how to develop further for your business requirement

internal class Program
{
    private  static string LoadContent(string reference)
    {
        string url = $"https://portal.thpa.gr/fnet5/track/index.php";

        var hc = new HttpClient();

        var reqUrlContent =
             hc.PostAsync(url,
            new StringContent($"d=1&containerCode={reference}&go=1", Encoding.UTF8,
            "application/x-www-form-urlencoded"))
            .Result;
         

        Stream stream =  reqUrlContent.Content.ReadAsStreamAsync().Result;

        HtmlDocument doc = new HtmlDocument();

        doc.Load(stream);

        return doc.DocumentNode.InnerHtml;
    }

    private static void Main(string[] args)
    {
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(LoadContent("OOLU0198315"));

        HtmlNode[] nodes = doc.DocumentNode
            .SelectNodes("//td[@style='padding:7px']")
            .ToArray();
        foreach (HtmlNode item in nodes)
        {
            Console.WriteLine(item.InnerHtml);
        }

        Console.ReadKey();
    }
}

results

results

coder_b
  • 827
  • 6
  • 15