Centre for Research in Mathematics And Data Science
School of Computer, Data and Mathematical Sciences
Description
For this assignment, you will need to create a complete program to perform sentiment classification for movie reviews from feature extraction to classification. For a given review, your
program should be able to predict whether it is positive i.e. like the movie, or negative, i.e. dislike the movie.
About the data
You will use a large movie review dataset containing a set of 25,000 movie reviews for training, and 25,000 for testing. You can download the data from the vUWS under the
assignments folder named aclImdb.zip. You can also visit the following website for more information about the dataset at http://ai.stanford.edu/~amaas/data/sentiment/ or download data
directly from there.
Unzip the data to your local directory. Enter the aclImdb/ directory created by the zip file (~500MB), you will find the following three items among others
train/: feature files and raw text files for the training set
test/: feature files and raw text files for the testing set
README: the readme file for more information on the dataset
Read the README file carefully about the descriptions on the text files that contain the reviews and their naming convention. The directories we concern here are
./aclImdb/train/pos: raw text files of positive reviews in the training set
./aclImdb/train/neg: raw text files of negative reviews in the training set
./aclImdb/test/pos: raw text files of positive reviews in the test set
./aclImdb/test/neg: raw text files of negative reviews in the test set
A full version of this data set is available at hdfs://hadoop-01-149-21-172.scem.uws.edu.au:9000/users/ugbigdata/imdb/fullversion. A tiny cut down version with much less number of
files (20 each for training and 10 each for test) is also available at hdfs://hadoop-01-149-21-172.scem.uws.edu.au:9000/users/ugbigdata/imdb/tinyversion. The tiny version is for
experimenting purpose.
Task 1. Feature extraction (15 points)
Use the map reduce model to convert all text data into matrices. Convert ratings to vectors. These will be used for classification in Task 2. Use TF-IDF to vectorise the text files. See
previous practical classes and lectures materials for TF-IDF. One step further though is to represent each text file (review) as a very long and sparse vector as the following. Assume
wordslist is the final list of distinct words contained in all reviews and its length is . Then each review will be a vector of length , with each position associated with a word in
wordlist and the value being either 0, if the corresponding word is absent in the review, or the word’s TF-IDF. For example, if wordlist = [‘word1’, ‘word2’, ‘word3’,
‘word4’] and review 1 contains word1 and word4 , then the vector representation of review 1 is [0.1, 0, 0, 0.4] assuming TF-IDF of word1 and word4 in review 1 is 0.1 and
0.4 respectively. Note that TF is calculated from one single document while IDF is obtained from all documents in the collection.
Requirements:
1. Map reduce model is a must. Implement it using Hadoop streaming. All data are available on SCDMS HDFS. The recommendation is to work on the tiny version of the data to make
the code work. You may try your code on the full version. However, the application to full version is not required.
2. Generate two matrices: training_data , test_data , and two vectors, training_targets , test_targets . training_data should have rows and columns
with each row corresponding to each review in the training set, where is the totally number of reviews in training set and is the total number of words. and vary
depending on which version of the data you use. training_targets should have elements each of which is the rating of the review is for. test_data and
test_targets are similar defined.
Note:
Ratings scores extraction can be purely python.
Using map reduce model to extract TF-IDF is mandatory. If not used, a 50% penalty for this task will incur. There is no constraint on how to form the training and test matrices and
vectors. There are many versions of TF-IDF. There is no preference for which version to use.
You can use data frame (using pandas package) instead of matrices and vectors to store training and test data and targets.
Marking scheme for task 1:
Rating scores extraction (3pts): parse the name of text files to extract ratings.
TF-IDF extraction (10pts): use map reduce model to extract TF-IDF for each text file.
Forming matrices and target vectors (or data frames) (2pts): collect TF-IDFs to form training and test data for task 2.