Tuesday, January 12, 2016

Is String.hashcode() unique enough?

Given a set of unique Strings is their set of String.hashcode() values unique enough?
Well... it depends on what you define as enough.
In my case below it was enough. Read how I assessed it.

It's clear that different inputs might map to the same hashcode value (2^32 different options), but what are the chances for it to happen?
I have one million users, each user owns 50 private items. An item is identified by a UUID.
I had these two conflicting goals:
(I) Represent each item as an integer instead of a UUID
(II) Avoid collisions. Any pair of items owned by the same user should resolve to a different hashcode.

What is enough: I could live with up to 10 users, out of a million, experiencing a collision. Most of these 10 users will never notice the collision. I assume the system will have other bugs with higher probably than that.
We're all unique

Assessing uniqueness

One way to asses is computing the statistical probability for such an event . But I preferred a "proof" that any programmer could appreciate even those without good statistics skills. Therefore I coded a simulation that simply ties it in practice:

Download from Gist
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package collisions.test;

import java.util.HashSet;
import java.util.Set;
import java.util.UUID;

public class UUIDToHashcodeUniquenessTestMain {

 private final static int num_of_users = 1000 * 1000;
 private final static int num_of_stacks = 50;

 public static void main(String[] args) {
  int collisions = 0;
  for (int i = 0; i < num_of_users; i++) {
   collisions += calcCollisionsForUser();
  }
  System.out.println("Had " + collisions + " collisions for " + num_of_users + " users");
 }

 private static int calcCollisionsForUser() {
  int collisions = 0;
  Set<Integer> uuidSet = new HashSet<Integer>(num_of_stacks * 2);
  for (int i = 0; i < num_of_stacks; i++) {
   String uuid = UUID.randomUUID().toString();
   Integer uuidHashcode = uuid.hashCode();
   if (uuidSet.contains(uuidHashcode)) {
    collisions++;
   }
   uuidSet.add(uuidHashcode);
  }
  return collisions;
 }
}

The program comes back saying that a collisions aren't really something to worry about:
Iteration 0: Had 0 collisions for 1000000 users
Iteration 1: Had 0 collisions for 1000000 users
Iteration 2: Had 0 collisions for 1000000 users
Iteration 3: Had 0 collisions for 1000000 users
Iteration 4: Had 0 collisions for 1000000 users




Thursday, October 31, 2013

A unit test to enforce max heap when running Android UT on the PC

2013's mobile devices come with 1-2GB RAM, yet Android still enforces a very small heap size of 24MB-64MB only (though it keeps increasing with time).
It's pretty easy to write an Android app that drains the heap. For example: Caching images w/o an LRU cache, reading whole files into memory instead of working with streams.

-- Your code will always use up as much memory as the system has (My spin on Parkinson's law).

I'm developing an Android app with a big UT suite that I run on Eclipse in the PC. I noticed that my default heap size is 256MB, huge compared to mobile, meaning my tests could pass on the PC, but still cause an OOME on an actual device.

So, I created MobileLikeSmallHeapDuringTestsEnforce, a new unit test to enforce a small heap size during Junit tests execution. Just make sure you throw it in to any test project you have and you're safe.

Created as a GitHubGist, you're welcome to make it better:

Tuesday, January 15, 2013

Quickest way for a one-off XML sort - OR - Learning to keep the heavytools in the shed

What's the quickest way/tool to sort a 1000 entries xml file?
Your requirements: The xml is parked on your desktop. You only need to sort it just once so you can manually examine it. Sort by the tag "relevance:score".

How would you go about it? Would you:
A) Craft a pipe stream of shell utils?
B) Use the heavy tools - a Java main() that with uses JDom?
C) Refresh the XSLT skills you never had?
D) Try your luck with a Python script?
E) search for an online XML editor tool?
F) Or, my pick at the bottom.

Example xml document to sort:
<feed>
    <entry>
        <title>Bibi</title>
        <score>0.21000001</score>
    </entry>
    <entry>
        <title>Lapid</title>
        <score>0.42000002</score>
    </entry>
    <entry>
        <title>Yechimovich</title>
        <score>0.235</score>
    </entry>
    <!--- 997 more entries -->
</feed>

My Pick: considered all the above but it sounded like a headache for a simple sort operation. So I've .... thrown the file at MS Excel, turns out it can digest it rather well, and then I sorted by the score column. Yes! Surprising. But crappy MS Excel did the job (the original schema had more nesting than the document in the example above).

Life-saver lesson: Spend time picking the right tool for the job, than on the job itself.

One click to sort

Wednesday, December 5, 2012

Specifing arrays size/capacity - my latest buggy code and a bestpractice

 
We're often required to specify size/capacity when allocating array like structures.

While you MUST specify size for Java's primitive arrays. With realizable arrays like ArrayList/Vector, specifying #capacity is too often a #premature-optimization, that you could avoid.




happened to introduce a bug to our code base, by initialized an ArrayList with a negative capacity value:

List<RetrievedDocument> docs =  new ArrayList<RetrievedDocument>(scoreDocs.length - resultsOffset);

Performing such profanity results in an #IAE being thrown (triggered by a NegativeArraySizeException thrown by the underlying primitive array structure.
public ArrayList(int capacity) {
firstIndex = lastIndex = 0;
try {
array = newElementArray(capacity);
} catch (NegativeArraySizeException e) {
throw new IllegalArgumentException();
}
}

Lesson for the next 50 years (or until I switch off from Java):

Whenever setting the size of an array (or anything sizable) to an unknown ahead value, spend a second to consider whether the value could be negative.
Then, if the desired resulting behavior is to create a zero-sized array instead, use this one-liner for that:
new ArrayList( Math.max(0, possiblyNegativeCapacity) );
Q: Do you know if other languages makes the use case above easier/less error prone?

Q: Any other related best practices and pitfalls you would like to share?