0

So i have this code, it strips the whole HTML source code to next cells in a columns. The problem is that the web page that I use to extract the HTML source code have some Polish letters like "ą","ś" and so on. Is there a way to paste the code with those Polish letters ? Now I'm getting some crazy squares with question marks and so on. Any tip ?

ps. I have this code thanks to @pizzettix https://stackoverflow.com/users/6254609/pizzettix

Sub audycje()
    
    Dim strona As Object
    Dim adres As String
    Dim wb As Workbook
    Dim a As Object
    Dim str_var As Variant
    
    Set wb = ThisWorkbook
    adres = InputBox("Podaj adres strony")
    If adres = "" Then
       MsgBox ("Nie podano strony do zaladowania")
    Exit Sub
    End If
    
    Set strona = CreateObject("htmlfile")   'Create HTMLFile Object
    With CreateObject("msxml2.xmlhttp")  'Get the WebPage Content
       .Open "GET", adres, False
       .send
       strona.Body.Innerhtml = .responseText
    End With
    
    'Split_with_delimiter_newline
    split_var = Split(strona.Body.Innerhtml, Chr(10))
    
    Application.ScreenUpdating = False
    
    For i = 0 To UBound(split_var, 1)
       Cells(2 + i, 2).Value2 = split_var(i)
    Next i
    
    Application.ScreenUpdating = True
    
    End Sub
  • What's an example input value and expected output? And why are you creating a late bound internet explorer object which is never used? – QHarr Jan 28 '21 at 19:43

3 Answers3

0

Option 1: You can use the "Get External data" function in Excel to import the html page.

One you get the data in the into the cells, you can use the following function to replace the odd characters or accented characters with the regular characters.

Below is a function I use to replace accented characters with regular characters:

Function StripAccent(thestring As String)
' Replaces accented characters with regular characters
  Dim A As String * 1
  Dim B As String * 1
  Dim i As Integer
  Const AccChars = "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåáçèéêëéìíîïðñòóôõöøùúûüýÿ"
  Const RegChars = "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaaceeeeeiiiidnoooooouuuuyy"
  For i = 1 To Len(AccChars)
    A = Mid(AccChars, i, 1)
    B = Mid(RegChars, i, 1)
    thestring = Replace(thestring, A, B)
    thestring = Application.WorksheetFunction.Trim(thestring)
  Next
  StripAccent = thestring
End Function 

Option 2: Another option is to import the document as "Unicode text". This should preserve the Polish characters.

For a test, I copied a Polish paragraph from a web page, and pasted it in an Excel spreadsheet cell using Paste Special >> Unicode Text, and it preserved the Polish characters.

Sammy
  • 877
  • 1
  • 10
  • 23
  • Friendly hint: If you don't want to overwrite the original string input, I'd pass the `thestring` argument by value, i.e. `Function StripAccent(ByVal thestring As String) As String`. First examples at SO (since 5/2012) refer to range conversions, e.g. [convert accented characters](https://stackoverflow.com/questions/10032322/how-to-call-a-macro-to-convert-accented-characters-to-regular-that-does-not-ap/10033042?r=SearchResults&s=12|0.0000#10033042) – T.M. Jan 28 '21 at 19:28
  • @Sammy I'm not familiar with `Functions`. Where exactly should I add the code you've posted ? – HasztagWojt Jan 29 '21 at 10:14
  • You put the function code in a vba module and use it like x = StripAccent(whatever) in your vba code. – Sammy Jan 29 '21 at 19:01
  • Do you see any place I can fix your Function` in ? The problem is as I have the whole HTML code as a `Variant` so the `Function` crushes ;/ – HasztagWojt Jan 31 '21 at 19:23
  • I modified my answer and added Option 2. Option 2 worked. – Sammy Jan 31 '21 at 23:19
  • @Sammy I'm trying to replicated your ideas, but still cant get the result ;/ Can i send you the file so you can look through it and fix it ? The most important for me is to make the whole process fully automated so the end user doesn't need to copy the HTML code itself, but through the `ImputBox` – HasztagWojt Feb 05 '21 at 10:49
0

For encoding problems add in the beginning (Function avilable in Office 2013 and above):

Mystring = WorksheetFunction.EncodeURL(Mystring)

see my original post at Extract content of div from Google Translate with VBA

If your Office version is before 2013 or if you need to distribute to user that might have older versions then use: How can I URL encode a string in Excel VBA?

Change your code like this:

Dim Mystring as string
For i = 0 To UBound(split_var, 1)
   Mystring= split_var(i)
   Mystring = WorksheetFunction.EncodeURL(Mystring)
   Cells(2 + i, 2).Value2 = Mystring
Next i
Noam Brand
  • 335
  • 3
  • 13
  • Where exactly should I add that line of code ? Before or after `split_var = Split(strona.Body.Innerhtml, Chr(10))` ? – HasztagWojt Jan 29 '21 at 09:22
  • I've added like this but it's not working :( `[...] MsgBox ("Nie podano strony do zaladowania") Exit Sub End If strInput = WorksheetFunction.EncodeURL(strInput) Set strona = CreateObject("htmlfile") 'Create HTMLFile Object` – HasztagWojt Jan 29 '21 at 15:52
  • I've got "Type mismatch" error on `split_var = WorksheetFunction.EncodeURL(split_var)` – HasztagWojt Jan 30 '21 at 10:54
  • changed the code above. Mystring is a string not an array that is why you get the "Type mismatch" error – Noam Brand Jan 30 '21 at 17:03
  • Thank you very much for the tips, I've changed the code as you advised, but still I'm not getting those special Polish letters ;/ – HasztagWojt Jan 31 '21 at 19:01
0

After a month of search I've finally got it ! The below code does the trick :)

Posting once more due to the moderator deleting my answer for no reason...

The below code does the exact job I WAS LOOKING FOR

Sub audycje()
    Dim strona As Object
    Dim adres As String
    Dim wb As Workbook
    Dim str_var As Variant
    Dim Mystring As String
    
    Set wb = ThisWorkbook
    adres = InputBox("Podaj adres strony")
    If adres = "" Then
       MsgBox ("Nie podano strony do zaladowania")
    Exit Sub
    End If
    
    Set strona = CreateObject("htmlfile")   'Create HTMLFile Object
    With CreateObject("msxml2.xmlhttp")  'Get the WebPage Content
       .Open "GET", adres, False
       .setRequestHeader "Content-Type", "text/plain;charset=UTF-8"
       .send
       strona.body.innerHTML = StrConv(.responseBody, vbUnicode)
    End With
    
    'Split_with_delimiter_newline
    split_var = Split(strona.body.innerHTML, Chr(10))
    
    Application.ScreenUpdating = False
    
    For i = 0 To UBound(split_var, 1)
    wb.Worksheets("Dane").Cells(2 + i, 2).Value2 = split_var(i)
    Next i
        
    Application.ScreenUpdating = True