9

There is a Git repository on GitHub called platform_frameworks_base containing part of the Android source code.
I wrote an application that replies on all the .aidl files from that project, so it downloads them all on first start.
Until now I did that by downloading the file Android.bp from the project root, extracting all file paths ending in .aidl from that file and then explicitly downloading them one by one.

For example if I found this file path:

media/java/android/media/IAudioService.aidl

I knew I could download it like this:

wget https://raw.githubusercontent.com/aosp-mirror/platform_frameworks_base/android-10.0.0_r47/media/java/android/media/IAudioService.aidl

This works fine until Android 10 (git tag: android-10.0.0_r47).
Starting with Android 11 (e.g. git tag: android-11.0.0_r33), the file paths use wildwards instead of complete paths. See this Android.bp.

It now just contains wildcard/glob file paths like:

media/java/**/*.aidl
location/java/**/*.aidl

etc...

My current "solution":

  1. Clone the repo (only the last commit of the branch we care about):

    git clone --depth=1 -b android-11.0.0_r33 https://github.com/aosp-mirror/platform_frameworks_base.git

  2. Extract the wildcard/glob paths from Android.bp.

    cat Android.bp | grep '\.aidl"' | cut -d'"' -f2

  3. Find all the files matching the wildcard/glob paths.

    e.g. shopt -s globstar && echo media/java/**/*.aidl

But the download process takes waaaaay to long because the repository contains over a gigabyte of binary files. Even if I just clone the last commit of the branch I care about.

Now my actual question is either:
How can I just download the .aidl files that I actually care about? (Ideally without parsing the HTML of every folder in GitHub.)
Or
How can I download/clone the repository without all the binary files? (probably not possible with git?)

Edit:

I tried using the GitHub API to recursively go through all directories, but I immediately get an API rate limit exceeded error:

g_aidlFiles=""

# Recursively go through all directories and the paths to all found .aidl files in the global g_aidlFile variable
GetAidlFilesFromGithub() {
    l_dirUrl="${1-}"
    if [ "$l_dirUrl" == "" ]; then
        echo "ERROR: Directory URL not provided in GetAidlFilesFromGithub"
        exit 1
    fi
    
    echo "l_dirUrl: ${l_dirUrl}"
    
    l_rawRes="$(curl -s -i $l_dirUrl)"
    l_statusCode="$(echo "$l_rawRes" | grep HTTP | head -1 | cut -d' ' -f2)"
    l_resBody="$(echo "$l_rawRes" | sed '1,/^\s*$/d')"
    if [[ $l_statusCode == 4* ]] || [[ $l_statusCode == 5* ]]; then
        echo "ERROR: Request failed!"
        echo "Response status: $l_statusCode"
        echo "Reponse body:"
        echo "$l_resBody"
        exit 1
    fi
    
    l_currentDirJson="$(echo "$l_resBody")"
    if [ "$l_currentDirJson" == "" ]; then
        echo "ERROR: l_currentDirJson is empty"
        exit 1
    fi
    
    l_newAidlFiles="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="file") | select(.path | endswith(".aidl")) | .path')"
    
    if [ "$l_newAidlFiles" != "" ]; then
        echo "l_newAidlFiles: ${l_newAidlFiles}"
        g_aidlFiles="${g_aidlFiles}\n${l_newAidlFiles}"
    fi

    l_subDirUrls="$(echo "$l_currentDirJson" | jq '.[] | select(.type=="dir") | .url')"
    if [ "$l_subDirUrls" != "" ]; then
        echo "$l_subDirUrls" | while IFS= read -r l_subDirUrl ; do 
            (GetAidlFilesFromGithub "$l_subDirUrl")
        done
    else
        echo "No subdirs found."
    fi
}

GetAidlFilesFromGithub "https://api.github.com/repos/aosp-mirror/platform_frameworks_base/contents?ref=android-11.0.0_r33"

From what I understand all my users would have to create a GitHub account and create an OAUTH secret to raise the limit. That's definitely not an option for me. I want my application to be easy to use.

Forivin
  • 14,780
  • 27
  • 106
  • 199
  • You can use the github api to [get a list of files in the repository](https://stackoverflow.com/questions/25022016/get-all-file-names-from-a-github-repo-through-the-github-api). You could then apply your glob patterns to that list and then download only the ones you want, rather than the whole repository. – larsks Mar 12 '21 at 14:00
  • Does this require an API key? And if so do all my users have to get their own API keys or can my users simply use an API key that I embed into my application? – Forivin Mar 12 '21 at 14:09
  • Okay nevermind, no API key is required. – Forivin Mar 12 '21 at 14:20
  • Okay the API doesn't seem to help too much because of the rate limit... – Forivin Mar 12 '21 at 16:46
  • I was wondering if there was some clever way to only request tree objects, but [I guess not](https://stackoverflow.com/questions/57530209/can-git-fetch-pack-be-instructed-to-fetch-a-single-tree-object). – larsks Mar 12 '21 at 16:57

4 Answers4

6

Since the repo's on GitHub, which supports filters, easiest is probably to use its filter support.

git clone --no-checkout --depth=1 --filter=blob:none \
        https://github.com/aosp-mirror/platform_frameworks_base
cd platform_frameworks_base
git reset -q -- \*.aidl
git checkout-index -a

which could probably be finessed quite a bit to get the files sent in a single pack instead of the one-at-a-time-fetch that produces.

For instance, instead of blob:none say blob:limit=16384, that gets most of them up front.

To do this in your own code, without relying on a Git install, you'd need to implement the git protocol. Here's the online intro with pointers to the actual Git docs. It's not hard, you send text lines back and forth until the server spits the gobsmacking lot of data you wanted, then you pick through it. You don't need to use https, github supports the plain git protocol. Try running that clone command with GIT_TRACE=1 GIT_PACKET_TRACE=1.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • This still downloads a gigabyte of data for me or at least a couple of hundred. – Forivin Mar 22 '21 at 11:23
  • Huh? `du -sh .git .` after doing that with blob:limit=8192 shows 24M and 7.9M for me, with a 16384 limit the checkout-index does only eight single-file fetches and ends up at 33M and 7.9M. – jthill Mar 22 '21 at 14:58
  • If you've got a git version that won't do this for you it should reject the filter option entirely. Check whether you've included the options above, me I like the 16384 one because there's only eight .aidl's bigger than that to fetch. – jthill Mar 22 '21 at 15:09
  • I've tried both `blob:none` and `blob:limit=16384`. I'm using git `2.26.3`. And I've also tried it with and without those variables. I haven't gotten to the point where I could run `du -sh .git .` because I have canceled the commands after 5-10 minutes because I could clearly see that git was downloading hundreds of mb of traffic. – Forivin Mar 22 '21 at 17:07
  • Yup. Just built 2.26.3 and tried it, that's flat-out bugged. bisecting for the fix. – jthill Mar 22 '21 at 17:37
  • 3
    Fixed in 2.27.0, specifically 167a575 – jthill Mar 22 '21 at 17:56
4

Not sure if this is what you wanted :

#!/usr/bin/env bash
  
get_github_file_list(){
    local user=$1 repo=$2 branch=$3
    curl -s "https://api.github.com/repos/$user/$repo/git/trees/$branch?recursive=1"
}

get_github_file_list aosp-mirror platform_frameworks_base android-11.0.0_r33 |\
    jq -r '.tree|map(.path|select(test("\\.aidl")))[]'
Philippe
  • 20,025
  • 2
  • 23
  • 32
1

You could use GitHub API code search endpoint to get the paths, but then use your wget raw.githubusercontent method to download them:

apiurlbase='https://api.github.com/search/code?per_page=100&q=repo:aosp-mirror/platform_frameworks_base+extension:aidl'
dlurlbase='https://raw.githubusercontent.com/aosp-mirror/platform_frameworks_base/android-10.0.0_r47/'
apiurl1="$apiurlbase+path:/media/java/"
apiurl2="$apiurlbase+path:/location/java/"
for apiurl in "$apiurl1" "$apiurl2"; do
  page=1
  while paths=$(
    curl -s "$apiurl&page=$page" | grep '"path": ' | grep -o '[^"]\+\.aidl'
  ); do
    # do your stuff with the $paths
    page=$(($page + 1))
  done
done

Unfortunately, the GitHub API code search endpoint only searches the default branch (in this case, master), whereas you want the android-10.0.0_r47 tag. There could be files in android-10.0.0_r47 but not in master, and this code won't find and download these.

An alternative solution is to do a very minimal clone of each tag you're interested in, and then use git ls-tree to get the paths of each tag, e.g.,

for tag in 'android-10.0.0_r47' 'android-11.0.0_r33'; do
  git clone --branch "$tag" --depth=1 --bare --no-checkout \
    --filter=blob:limit=0 git@github.com:aosp-mirror/platform_frameworks_base.git
  # only a 1.8M download
  mv platform_frameworks_base.git "$tag"
  cd "$tag"
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
  cd ..
done

If this is for own use, I probably wouldn't use either of these methods. I would just clone the entire huge repo once and then work with it locally, e.g.,

if [ -e platform_frameworks_base ]; then
  cd platform_frameworks_base
  git pull
else
  git clone git@github.com:aosp-mirror/platform_frameworks_base.git
  cd platform_frameworks_base
fi
tags=$(git tag | grep '^android')
for tag in $tags; do
  git checkout $tag
  paths=$(git ls-tree -r HEAD --name-only | grep '\.aidl$')
  # do your stuff with the paths
done
webb
  • 4,180
  • 1
  • 17
  • 26
  • The GitHub search wouldn't work because the master and the branches are completely different. I need the exact the files of a specific branch. For each user potentially a different one. Your minimal git clone downloaded more than a gigabyte of data - not 1.8M for me. Downloading the whole repository is not an option because I have hundreds of users that I don't want to force to unnecessarily download a gigabyte of data. – Forivin Mar 15 '21 at 10:32
  • "Your minimal git clone downloaded more than a gigabyte of data - not 1.8M for me". Interesting. What git version? Mine's `2.24.3`. – webb Mar 16 '21 at 17:48
  • My git version is 2.26.3. I did the clone with `https://` instead of `git@` because didn't want to add a key btw. – Forivin Mar 16 '21 at 19:50
1

Give the circumstances I would maintain a text file that is automatically updated with the latest repo file tree before each commit.

The script should be easy to write and be fast to run since all this is happening locally. It can be called manually by introducing a new work process or be integrated into your test/CI automation process.

Then you know what to do in your end-user application, download this file first, filter it out with the Android.bp, then extract the files you want with the Github raw content links.

alex
  • 799
  • 7
  • 8
  • 1
    I had a bad experience with relying on CI systems for some other stuff that created a lot of CI run time. In the end it broke because GitLab added time limits and size restriction on their CI. I expect GitHub Actions and other free CI systems will do the same eventually because they need to pay for the servers somehow. – Forivin Mar 20 '21 at 22:21