Since there's talk about discrete and continuous probability, I'm gonna go off on a tangent here for those that are interested: (quite advanced material, so don't worry if it makes zero sense, I'm just trying to provide some foresight)
In "classical" probability, everything is either "mass" (discrete) focused or "density" focused (continuous). Let's just review these 2 types of definitions.
Discrete: Assume we have a finite or infinitely countable set of outcomes denoted by

, then for each

, we assign a mass
)
such that:
 \in [0, 1] \ \text{for all} \ x \in S)
and
 = 1)
Continuous: If the sample space of a random variable

is the set of all reals then the CDF is given by
 = P(X\le x))
.

is a continuous random variable with density
)
if
)
is an absolutely continuous function. Furthermore,
)
is a monotonically non-decreasing, right-continuous function with the left limit as 0 and right limit as 1.
In fact we can go one step further and generalise both discrete and continuous distributions from a
measure-theoretic point of view (using
measure theory).
The main motivation behind new type of treatment of probability is this that not all distributions can be classified as discrete or continuous, they may be a mix of the two or neither (classic example is the
Cantor Distribution). Another motivation is that relatively simple concepts such as independence become confusingly hard to define when we deal with vectors of random variables, to illustrate:
Let

be a random n by 1 vector and

be a random k by 1 vector (where n does not necessarily have to equal to k), then how do we define independence between

and

?
For example, if we take just 2 random variables, X, Y then X and Y are independent iff f(x,y) = f(x)f(y) where f(x,y) characterizes their joint pdf and f(x), f(y) are their marginal pdfs. If we adapt this onto the random vector case, we have
 = f(\mathbf{e}) f(\mathbf{b}))
, examining the RHS, we see that
)
is simply the
joint distribution of

and likewise,
 )
is the
joint distribution of

, but then what is
 )
? How is it defined?
Continuing on, for a more special case, consider when

is jointly multivariate normal, ie,
)
and

is also jointly multivariate normal, ie,
)
. Now in the special case of when two jointly normally distributed variables X and Y, a sufficient condition for independence is when cov(X,Y) = 0 where cov(.) denotes the covariance between X and Y. How then, do we generalise that into the random vector case? We could use the "normal" definition of independence and say

and

are independent iff
 = 0)
. But then we would have to show

and

are jointly normally distributed... but this doesn't really make sense since

and

are already
itself jointly normally distributed, so then we would be talking about the 'joint' distribution of two joint distributions... how do you make sense of that?
Measure theory seeks to generalise probability theory, the main set up goes like this:
We first must define what is exactly meant by a "probability space", intuitively we need a sample space, then subset of the sample space...ie, events, and finally a function of some sort that assignments probabilities to these events. Note, this is intrinsically different to the random variable world of probability which follows a progression of event -> random variable -> outcome -> probability function -> probability. Measure theory seeks to go back to the root (which is events) and follows a progression of sample space -> event -> probability function -> probability. Note: I'm being super informal here, but it's just to illustrate the basic intuition...
More formally, a probability space consists of three parts:
1. A sample space (which is a
set)
2. A set of events denoted by

.
3. Some function

which assigns probability to events.
1. and 3. are perhaps "easy" to view and define, the problem resides in 2. In classical probability, 2. is characterised by random variables, in measure theory, we define something called a sigma-algebra (

-algebra). I will simply provide the definitions of a sigma-algebra, however as you become more accustomed to measure theory, it will become very obvious why the axioms of a sigma-algebra are what they are. More precisely:
Let

be some set, and let

denote the power set (ie, set of all subsets). Then a subset

is a sigma-algebra if it satisfies:
1. There is at least one

.
2. If

is in

, then so is its complement.
3. If

(note infinite) are in

, then so is

(note: again infinite)
2. and 3. are often called closed under complementation and countable unions (bears resemblance to definitions of subspace... for those that have done linear algebra)
We could also define something even more neat called a Borel sigma-algebra, I won't bother listing the axioms but for anyone that's interested:
http://en.wikipedia.org/wiki/Borel_algebraUnder the above definitions, we define the measure theoretic definition of the probability of a set X (from a sigma algebra,

) by:
 = \int_{x \in X} \mu_F(dx))
where we say that the integral is with respect to a measure "induced" by the sigma algebra

called

. For anyone that's interested, we call this
http://en.wikipedia.org/wiki/Lebesgue_integration.
Anyhow, it turns out that the above formulation provides a sort of "unification" of discrete and continuous probability distributions, it is much more flexible and the treatment it provides allows us to define many things which were confusing (or not possible) under classical probability