2

Solution I'm looking for: Since IMPORTXML is failing to import the data, is there any way to import the data directly via script without needing to create formulas in the spreadsheet? I cannot create a custom formula because these formulas have limitations of use, as I will use a lot, the limit will always happen and it cannot happen.

Xpath is perfectly correct but sometimes this error mysteriously occurs, I cannot understand what may be happening.

=
IMPORTXML("https://int.soccerway.com"&
IMPORTXML(A1,
"//*[@class='container left']//*[@class='last-five']/a[1]/@href"),
"//*[@class='playerstats lineups table']//@href")

Error
Internal import error

enter image description here

Link to Spreadsheet:
https://docs.google.com/spreadsheets/d/1nA3NFKhrON8wJgBiX0XcyEQqk6OZVoB3VrGaROsNPjc/edit?usp=sharing

Digital Farmer
  • 1,705
  • 5
  • 17
  • 67

3 Answers3

4

To complete, you can shorten your first XPath with :

(//a[@title][2])[2]/@href

EDIT : Since this XPath fails sometimes, just stick with :

//div[@class='container right']/div[@class='last-five']/a[2]/@href

To select only players (not coaches) and substitutes (who played) you can use :

//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]][not(parent::p[@class="substitute substitute-out"] or count(ancestor::td/p)=1)]/@href

EDIT :

Here's a WORKBOOK which works with IMPORTXML or IMPORTFROMWEB addon (number of requests are limited with the free plan).

First sheet is with IMPORTXML (one liner). Formula :

=IMPORTXML("https://int.soccerway.com/"&IMPORTXML(C1;"//div[@class='container right']/div[@class='last-five']/a[2]/@href");"//div[@class='combined-lineups-container']//a[@href[contains(.,'players')]]/@href")

IXML

Second sheet is with IMPORTHTML (breakdown in 2 parts). XPath used (get the url, players urls, players who played the match urls) :

//div[@class="container right"]/div[@class="last-five"]/a[2]/@href
//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]]/@href
//div[@class="combined-lineups-container"]//a[@href[contains(.,"players")]][not(parent::p[@class="substitute substitute-out"] or count(ancestor::td/p)=1)]/@href

IHTML

Third sheet is with IMPORTHTML (one liner). Formula used :

=IMPORTFROMWEB("https://int.soccerway.com/"&IMPORTFROMWEB(C1;"//div[@class='container right']/div[@class='last-five']/a[2]/@href");"//div[@class='combined-lineups-container']//a[@href[contains(.,'players')]]/@href")

IHTML2

Alternative if it fails with IMPORTXML or IMPORTFROMWEB : IMPORTDATA + regexes.

IDT

To generate the second url (second url) with the starting url, use something like :

="https://int.soccerway.com"&REGEXEXTRACT(INDEX(QUERY(IMPORTDATA(A2);"select * WHERE Col1 ENDS WITH '>D</a>' or Col1 ENDS WITH '>W</a>' or Col1 ENDS WITH '>L</a>'");7;1);"href=""(.*?)""")

The QUERY could be optimized with "matches".

To get the players name (Players v1), use :

=ARRAYFORMULA(REGEXEXTRACT(QUERY(IMPORTDATA(B2);"select Col1 WHERE Col1 STARTS WITH '<a' and Col1 CONTAINS 'flag_16 left' and Col1 CONTAINS 'players'");"href=""(.*?)"""))

You can consult my sheet HERE.

  • Cells with blue background contain formulas (mostly ARRAYFORMULA)
  • Cells with yellow background : short way to get the data
  • Cells with pink background : alternative way to filter the players who really played the match (a bit complicated, could be optimized)

EDIT 2 : "Lineups" sheet has been added to the IMPORTDATA workbook. That's an example to extract lineups (22 players) for the last 3 matches of home and away team of the start url. Example with : Lugano vs. Basel - 1 July 2020.

IDT2

Sometimes, Soccerway doesn't have the lineups. In that case, "no lineups" is returned.

E.Wiest
  • 5,425
  • 2
  • 7
  • 12
  • 1
    Unfortunately the failure keeps happening. Link to Spreadsheet: https://docs.google.com/spreadsheets/d/1nA3NFKhrON8wJgBiX0XcyEQqk6OZVoB3VrGaROsNPjc/edit?usp=sharing – Digital Farmer Jun 30 '20 at 23:15
  • 3
    Hi. Post is edited with multiple solutions and 2 shared workbooks. – E.Wiest Jul 01 '20 at 05:24
  • 1
    Thank you very much for all the time you spent creating the answer and for that I sincerely apologize, but unfortunately the options mentioned ended up not corresponding to my need, the options that literally specify the "playerstats lineups table" field continue to fail some times, and the options with IMPORTDATA fail to faithfully filter the data that is inside "playerstats lineups table" I will try to change the approach of the question and look for an option to import the data without using formulas in the cells and try to collect directly via script. – Digital Farmer Jul 01 '20 at 14:50
  • 3
    Hi . I've updated the `IMPORTDATA` workbook with another sheet : "Lineups". I suppose you can optimize it a bit. It fetches the lineups (22 players) for the last 3 matches of home and away team of the start url. I'm not facing any filtering problems with this method (do you have examples ?). Also, no mysterious "disappearances" (unlike `IMPORTXML`) for now. – E.Wiest Jul 01 '20 at 17:58
  • As soon as I get home I will put her to tests of several games to confirm, thank you very much for everything, I am learning a lot with the information you are sending and explaining! – Digital Farmer Jul 01 '20 at 21:01
3

When using IMPORTXML and generally when trying to do web-scraping "mysterious" errors happens and there isn't anything that we could do to avoid them specially when the data source belongs to a third-party, we can only set contingency measures.

To do this you could use an algorithm exponential backoff, briefly this works this way:

Using a loop try to get the data, exit when the data is got. On each iteration include a delay that should be increased. You should decide if you will set limits for the number of iterations or if this should be done until the data is got.

Usually you should set a limit and some sort of alert so you could investigate what is happening.

By the other hand, Google Apps Script doesn't include good tools to parse HTML.

  • There is the XmlService but it only works for well formed XHTML.
  • While it could possible to use regular expressions to extract some text, this is hacky.

Related

Rubén
  • 34,714
  • 9
  • 70
  • 166
0

try:

=IFERROR(IMPORTXML("https://int.soccerway.com"&
 IMPORTXML("https://int.soccerway.com/matches/2020/06/28/austria/bundesliga/lask-linz/wolfsberger-athletik-club/3246469/",
 "//*[@class='container right']//*[@class='last-five']/a[2]/@href"),
 "//*[@class='playerstats lineups table']//@href"), 
 IMPORTXML("https://int.soccerway.com"&
 IMPORTXML("https://int.soccerway.com/matches/2020/06/28/austria/bundesliga/lask-linz/wolfsberger-athletik-club/3246469/",
 "//*[@class='container right']//*[@class='last-five']/a[2]/@href"),
 "//*[@class='playerstats lineups table']//@href"))
player0
  • 124,011
  • 12
  • 67
  • 124
  • 1
    Unfortunately the failure keeps happening. Link to Spreadsheet: https://docs.google.com/spreadsheets/d/1nA3NFKhrON8wJgBiX0XcyEQqk6OZVoB3VrGaROsNPjc/edit?usp=sharing – Digital Farmer Jun 30 '20 at 23:07