1

I would like to extract a JSON file compressed with Lzip (.lz). I have tried with untar, unzip, and the archive library, sadly none of them work.

download.file(url = "https://parltrack.org/dumps/ep_votes.json.lz",
              destfile = "ep_votes.json.lz",
               mode = "wb")
archive("ep_votes.json.lz")
# Erreur : archive.cpp:37 archive_read_open1(): Unrecognized archive format

untar("ep_votes.json.lz", exdir = ".")
# tar.exe: Error opening archive: Can't initialize filter; unable to run program "lzip -d -q"
# Warning message:
# In untar("ep_votes.json.lz", exdir = ".") :
#   ‘tar.exe -xf "ep_votes.json.lz" -C "."’ returned error code 1

unzip("ep_votes.json.lz", exdir = ".")

# Warning message:
# In unzip("ep_votes.json.lz", exdir = ".") :
# erreur 1 lors de l'extraction d'un fichier zip

Here is the documentation about lzip: https://www.nongnu.org/lzip/lzip.html.

It works naturally with Winrar but I would like to do it in R directly.

Do you have an idea on how to fix those errors or is there another solution?

JMCrocs
  • 77
  • 7
  • 1
    I downloaded that file and tried to use the command-line `lunzip` on it, and it says effectively the same thing: `Decoder error at pos 149`. That appears to be a corrupted lz file. – r2evans Sep 13 '21 at 18:30
  • I added "mode = "wb" and now the file is not shown as corrupted and I can unzip it with Winrar but not with R yet – JMCrocs Sep 13 '21 at 18:41
  • 1
    Good, you found the `mode=` problem (I hadn't found yet :-). I can't get it to work using `archive_extract("ep_votes.json.lz", "ep_votes.json")` or with `archive_read(.., format="lzma")`, instead seeing `Unrecognized archive format`. Not sure what's going on. In a pinch, if you have `lunzip` installed in the OS itself, you can extract it via something like `system("lunzip ep_votes.json.lz")`. – r2evans Sep 13 '21 at 18:59
  • As I did not figure to install lunzip on Windows, I guess I will, for now, extract it manually with WinRAR. Thanks for your help! – JMCrocs Sep 13 '21 at 23:06
  • https://github.com/r-lib/archive/issues/52 here is the solution! – JMCrocs Sep 14 '21 at 14:12
  • 1
    That seems to make sense, though I have [suggested](https://github.com/r-lib/archive/issues/52#issuecomment-919215810) that perhaps `format="lzma"` should default to this behavior. Nice! – r2evans Sep 14 '21 at 14:40

1 Answers1

2

Jim Hester gave the answer through his GitHub :

lzip is a compression format, not an archive format, e.g. it compresses only a single file, it does not store multiple files like a zip or tar archive would.

So you need to use archive::file_read() rather than archive().

e.g. data <- jsonlite::parse_json(archive::file_read("ep_votes.json.lz"), simplifyVector=FALSE)

source : https://github.com/r-lib/archive/issues/52

StayOnTarget
  • 11,743
  • 10
  • 52
  • 81
JMCrocs
  • 77
  • 7