There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!

1How big is an integer? Is this a 10G file that contains integers stored as text, or in binary format?– Will AAug 26 '10 at 6:48

Is the count of integers known?– Abhinav SarkarAug 26 '10 at 7:06

I have updated my question, please review it. @Will A:the integer is as big as the computer can represent it. @abhin4v:yes as i have updated my question, its 10G(1000000000)– didxgaAug 26 '10 at 7:11
Create an array of 8byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n)
time and O(1)
space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2  (topCounttopHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.

The complexities are really impressive. Kind of remind me the radix sort / bucket sort ideas. Aug 27 '10 at 9:36

1+1. Works like a charm! I understood the concept of generating the histogram for top bits to get to the median bucket (sort of like counting sort), but what I couldn't understand is why ignoring the top bits while creating histogram for bottom bits, and then counting up to midpoint get us the right answer. Any explanation on that part would be much appreciated!– KayNov 12 '11 at 18:26

@Kay  You don't ignore the top bitsyou pick only those that have the same top bits as the central bin. So you're really creating a histogram for the whole number, but you don't need to represent the part that is in common.– Rex KerrNov 12 '11 at 18:30

@RexKerr Sorry to have misread the solution. Thanks for the clarification.– KayNov 12 '11 at 18:35

1This method does not work for array with negative integers. if arr = [3,2,1, 0, 1, 2, 3] will return 3.– chishuiApr 17 '18 at 1:22
You can use the Medians of Medians algorithm.

2+1, the factor of 5 difference between 10G and 2G sounds like this is the expected answer. Aug 26 '10 at 22:42

@Ants Aasma, 10G integers is usually 40GB which is 20x of 2GB or RAM. However Medians of Medians can still work.– grokusAug 27 '10 at 1:46


It should be noted that this gives only an approximation of the median. Taking the list in the Wikipedia example and sorting them, the middle values are 49 and 50 (so you could define the median to be 49.5) and NOT 47 as highlighted in the example there. Aug 28 '10 at 21:25

1@Andre Holzer, it is a selection algorithm that uses an approximation of the median to find, for example, the exact median in O(n). The example on Wikipedia is about how to find that approximation, the pivot for the pseudo code above it.– IshtarAug 30 '10 at 12:46
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the xth largest integer. You can discard all of these except for the xth largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.

@ajduff574:as i have updated my question, there are 10G(1000000000) integers– didxgaAug 26 '10 at 7:16

1+1 Impressive. But I foresee a problem dealing with large amounts of repeated numbers. Imagine that you are halfway through a pass through the file and your 2G sorted memory array fills up. Throughout the second half of your pass, you do not encounter any numbers that enable you to evict elements from your 2G array, but you encounter a lot of numbers that are exactly equal to the smallest element in the array (x1). You reach the end, discard your list, begin the next pass, and realize that you don't know which of the x1's you have discarded previously and which you have not.– advaitAug 26 '10 at 7:29

2A possible solution might be not to use x1 in this case and use x1+1, thereby discarding everything in your 2G array that is greater than or equal to x1+1. However, you might reach a point where your entire 2G array becomes homogeneous. What then? You can't discard any numbers!– advaitAug 26 '10 at 7:31

2Good catch! I think you could also keep track of how many times the smallest number (x) has appeared in your array. If it appears n times, then on the next pass you find the largest numbers <= to x, discarding the first n that you find that are equal to x. If it appears only once, this is the same as above. Even if the whole list is the same, this should still work. Aug 26 '10 at 7:38

1This has to be repeated n/2k times in the worst case. So if you use a sorted list implementation, that can do search, insert and delete in O(log n), the time complexity is O(n<sup>2</sup>/k log k), where k is the number of integers in the sorted list. That means if k comes close to n it tends to be O(nlog n) and in the other case if k is a lot smaller than n it tends to be O(n<sup>2</sup>). In the given case k is close to n/2. So its nearly O(nlog n). Median of Medians algorithm is O(n) but it maybe needs to save the partitions to the disk, so i think this should probably be faster. +1 Aug 26 '10 at 14:09
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint  by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.

Impressive, seems to me this should take N*log(N) time, and use O(1) memory (with an extremely low constant). Apr 21 '13 at 11:54

Very nice! @avl_sweden  note that it's not really Nlog(N), because the log isn't on the number of elements in the array, it's on the number range ! so basically, for 64bit integers, it's Nlog(2^64), aka 64*N :)– ZeDuSOct 19 '13 at 18:33
 Do an ondisk external mergesort on the file to sort the integers (counting them if that's not already known).
 Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n
= number of integers in the original file:
 Running time:
O(nlogn)
 Memory:
O(1)
, adjustable  Disk:
O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
 Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
 Then calculate median of these integers and assign it to variable X_new.
If iteration is not first  calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much  this means that you found approximate median of data.
But there some notes :
 question arises  Is median error acceptable or not.
 integers must be distributed randomly in a uniform way, for solution to work
EDIT: I've played a bit with this algorithm, changed a bit idea  in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean  ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by @Rex Kerr implemented in Java.
/**
* Computes the median.
* @param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* @return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m = zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins. This is actually interesting—if you had two heaps, you could keep track of the bigger half and the smaller half of the elements. The bigger half is kept in a min heap, such that the smallest element in the bigger half is at the root.The smaller half is kept in a max heap, such that the biggest element of the smaller half is at the root. Now, with these data structures, you have the potential median elements at the roots. If the heaps are no longer the same size, you can quickly "rebalance" the heaps by popping an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.