Centre for Research in Mathematics And Data Science
 School of Computer, Data and Mathematical Sciences
Description
 For this assignment, you will need to create a complete program to perform sentiment classification for movie reviews from feature extraction to classification. For a given review, your
 program should be able to predict whether it is positive i.e. like the movie, or negative, i.e. dislike the movie.
 About the data
 You will use a large movie review dataset containing a set of 25,000 movie reviews for training, and 25,000 for testing. You can download the data from the vUWS under the
 assignments folder named aclImdb.zip. You can also visit the following website for more information about the dataset at http://ai.stanford.edu/~amaas/data/sentiment/ or download data
 directly from there.
 Unzip the data to your local directory. Enter the aclImdb/ directory created by the zip file (~500MB), you will find the following three items among others
 train/: feature files and raw text files for the training set
 test/: feature files and raw text files for the testing set
 README: the readme file for more information on the dataset
 Read the README file carefully about the descriptions on the text files that contain the reviews and their naming convention. The directories we concern here are
 ./aclImdb/train/pos: raw text files of positive reviews in the training set
 ./aclImdb/train/neg: raw text files of negative reviews in the training set
 ./aclImdb/test/pos: raw text files of positive reviews in the test set
 ./aclImdb/test/neg: raw text files of negative reviews in the test set
 A full version of this data set is available at hdfs://hadoop-01-149-21-172.scem.uws.edu.au:9000/users/ugbigdata/imdb/fullversion. A tiny cut down version with much less number of
 files (20 each for training and 10 each for test) is also available at hdfs://hadoop-01-149-21-172.scem.uws.edu.au:9000/users/ugbigdata/imdb/tinyversion. The tiny version is for
 experimenting purpose.
 Task 1. Feature extraction (15 points)
 Use the map reduce model to convert all text data into matrices. Convert ratings to vectors. These will be used for classification in Task 2. Use TF-IDF to vectorise the text files. See
 previous practical classes and lectures materials for TF-IDF. One step further though is to represent each text file (review) as a very long and sparse vector as the following. Assume
 wordslist is the final list of distinct words contained in all reviews and its length is . Then each review will be a vector of length , with each position associated with a word in
 wordlist and the value being either 0, if the corresponding word is absent in the review, or the word’s TF-IDF. For example, if wordlist = [‘word1’, ‘word2’, ‘word3’,
 ‘word4’] and review 1 contains word1 and word4 , then the vector representation of review 1 is [0.1, 0, 0, 0.4] assuming TF-IDF of word1 and word4 in review 1 is 0.1 and
 0.4 respectively. Note that TF is calculated from one single document while IDF is obtained from all documents in the collection.
 Requirements:
 1. Map reduce model is a must. Implement it using Hadoop streaming. All data are available on SCDMS HDFS. The recommendation is to work on the tiny version of the data to make
 the code work. You may try your code on the full version. However, the application to full version is not required.
 2. Generate two matrices: training_data , test_data , and two vectors, training_targets , test_targets . training_data should have rows and columns
 with each row corresponding to each review in the training set, where is the totally number of reviews in training set and is the total number of words. and vary
 depending on which version of the data you use. training_targets should have elements each of which is the rating of the review is for. test_data and
 test_targets are similar defined.
 Note:
 Ratings scores extraction can be purely python.
 Using map reduce model to extract TF-IDF is mandatory. If not used, a 50% penalty for this task will incur. There is no constraint on how to form the training and test matrices and
 vectors. There are many versions of TF-IDF. There is no preference for which version to use.
 You can use data frame (using pandas package) instead of matrices and vectors to store training and test data and targets.
 Marking scheme for task 1:
 Rating scores extraction (3pts): parse the name of text files to extract ratings.
 TF-IDF extraction (10pts): use map reduce model to extract TF-IDF for each text file.
 Forming matrices and target vectors (or data frames) (2pts): collect TF-IDFs to form training and test data for task 2.
