2

It's probably a stupid question but here's the thing. I was reading this question:

Storing 1 million phone numbers

and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:

I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion

Problem: how do I do that in Java? And ASCII text does stand for String?

Community
  • 1
  • 1
dierre
  • 7,140
  • 12
  • 75
  • 120

4 Answers4

3

For in-memory storage as indicated in the question:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
    new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
    out.write(number);
    out.write('\n');
}
byte[] data = baos.toByteArray();

But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • Thanks. This interview question was freking me out. – dierre May 09 '11 at 08:22
  • If the order doesn't matter, I would sort the numbers first as they will compress much better as each number will tend to be almost the same as the previous. – Peter Lawrey May 09 '11 at 15:33
1

Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.

Kaj
  • 10,862
  • 2
  • 33
  • 27
  • I know, but that was in relation to the context of the question. The numbers have the same value as in ASCII, unless we talk about Arabic numbers. – Kaj May 09 '11 at 08:29
  • 1
    ASCII implies one byte per character, Java Strings use two. And regular ASCII numbers *are* Arabic numbers... – Michael Borgwardt May 09 '11 at 08:33
  • Ok, I should have said Arabic-Indic, Extended Arabic-Indic or Eastern Arabic. Those numbers don't look as the ones that we use in the western world, and they are outside of the ASCII table. Yes, two bytes in memory for Java characters, but you won't write two bytes if you write numbers to a file unless you specify an odd encoding. – Kaj May 09 '11 at 09:02
1

In answer to an implied, but different question;

Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)

A: This is the general approach

  • First sort the list if its not sorted already.
  • From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
  • Write the stream to a compressed data set.
  • Test this to compare with a simple sending of all numbers.

You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.

The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.

If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like

NavigableMap<String, PhoneDetails[]>

In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)

BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.


Alot depending on what you want to do with the data and why you have it in memory at all.

You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.

I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I suspect that if the question was asked at Google, it was "1 billion" rather than "1 million" for that exact reason. And I'm pretty sure that a TreeMap of Strings will have rather more overhead (TreeMap is not a memory-efficient data structure). – Michael Borgwardt May 09 '11 at 08:19
  • 1
    If the question were significantly different I would give a different answer. While TreeMap is not very efficient, Having to decompress all the data and parse every time you want to get a number out, would be inconvenient. A lot depends on what you want to do with the data. I doubt you would put 1 billion phone numbers into a single database. – Peter Lawrey May 09 '11 at 08:22
  • It is a question from a job interview at Google (apparently). – dierre May 09 '11 at 13:41
  • @dierre, Some interviewers prefer the theoretical answers and some prefer the real world experience answers. You have to judge which sort of answer they are looking for, or find a way to give both. Often the fastest/most efficient way to do something you don't really need to do is; don't do it all ;) – Peter Lawrey May 09 '11 at 13:46
-1

Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.

krookedking
  • 2,203
  • 20
  • 21
  • -1: While the internal encoding is not specified by the Java spec, the API implies UTF-16, and pretty much all implementations use it internally as well. UTF-8 is only used for literals in class files. – Michael Borgwardt May 09 '11 at 08:24