Data Mining Programming Assignment

For the individual project, each student will write a program for discovering approximate functional dependencies in a data set, representing an instance of a universal relation.

Conceptual guidance for the project can be found in the lectures of April 3 and April 5, as well as this paper, particularly Section 4. The paper uses the AI term for functional dependencies — determinations — and you might be otherwise unfamiliar with the AI and machine learning terminology of the paper, but together with the lecture, I think you can follow Section 4.

In particular, you will write and test a collection of functions that learn approximate functional dependencies from data. The functions are to be written in PYTHON 3, and the top level function is

find-approximate-functional-dependencies (DataFileName, depth-limit, minimum-support)

where DataFileName is the name of a CSV file in the program’s local directory, which is formatted as

A,   B,   C,  D,  E...
a2,  b1,  c3, d2, e6, ....
a3,  b1,  c2, d5, e1, ....
a1,  b4,  c1, d3, e1, ....
a2,  b1,  c3, d2, e6, ....
....

There are M columns and N+1 rows to this file.

The columns of this file correspond to the M different attributes, say A, B, C, … that describe the universal relation. The first row (the zeroth row) gives the M attribute names used to describe the data. The remaining N rows are the N records (or tuples) in the instance of the relation. The value given in row i (i>=1) of column j is the value of attribute j (as given in the first row of attribute names) for tuple i

You are provided with a helper function that converts the CSV file to a list of lists, which you should call in one of your functions to convert the data into an internal representation that your code can use. You can, of course, convert this list of list into another data structure as well. A skeletal Python script (as a .txt file; save as .py file), along with the CSV helper function, is here. A very simple sample data set is here, which is only provided to check the read function — remember that designing a data set with predictable FDs is part of your task — we will supply some data sets  later. (UPDATE: see data here)

In a real BIG-BIG-BIG-data mining implementation, you wouldn’t read the entire data set into main memory at the outset, but we are not experimenting with BIG-BIG-BIG data.

depth-limit, referred to as k in both lecture and Section 4.1 of the paper, is an integer that limits the depth of search through the space of domains of functional dependencies.

minimum-support, referenced on the last slide of April 3 lecture, is the threshold for identifying adequately approximate FDs (e.g., 0.95 in the lecture slide).

Your function, find-approximate-functional-dependencies, will return a list of tuples, where a tuple representing the FD A,B,C–>D (with support 0.98 — see lecture) is represented as ([A,B,C],D, 0.98). Note that the first element of a tuple is a list of the attributes in the domain of the FD, and the second element of the tuple is a single attribute name. An example result of the find-approximate-functional-dependencies (with depth-limit = 3, minimal-support=0.90) is

[([A],C, 0.91), ([C, F],E, 0.97), ([A,B,C],D, 0.98), ([A, G, H],F, 0.92)]

Note that there are no domains over cardinality 3 (because depth-limit=3), and no FDs with less support than 0.90.

An example result of the find-approximate-functional-dependencies (with depth-limit = 2, minimal-support=0.80) is

[([A],C, 0.91), ([C],A, 0.89), ([F],G, 0.81), ([B,C],D, 0.90), ([B, E],C, 0.85), ([D, F],B, 0.81)]

We (and you) will pretty print (pprint) your results and otherwise inspect them for correctness.

You will upload one .py (.txt) file with your Python 3 code to Brightspace. You may not use libraries (though this restriction may change shortly). If you

  • correctly implement find-approximate-functional-dependencies and all your auxiliary functions,
    • for both on data sets you will be given ahead of time and those we use to grade;
  • nicely format and comment your code with comprehensible and informative function header comments;

then you will receive an A- score (90%). If, in addition, you implement some efficiency enhancement, or some other addition found in April 12 lecture, and explain it clearly in comments at the top of the submission file, then you can receive up to 100%.