Archive for the ‘Uncategorized’ Category

Test 2 solutions posted, notes about the final

December 7, 2011

I posted the second exam and some solutions.  Feel free to look over them before the exam.  Remember that the exam will not cover Networks I or Natural Language Processing.  Don’t worry too much about the details you saw on HW 4, especially related to the HITS algorithm, appearing on the exam.

Homework 6 is out

December 7, 2011

Homework 6 has been posted.  It’s due via email by December 11th at 11:59 pm (this Sunday, end of day).  Try to give useful comments as it is appreciated and will result in you getting full credit.  You do not have to write an extremely detailed account if you don’t want to, but a one sentence response will not get you 50 points

In the lecture slides for today I grayed out the two subjects, Networks I and NLP, that you voted off of the final subject list.

Handing in your HW

November 30, 2011

The solution to problems 1 and 3 should be handed in ( printed out with cover sheet attached)  at the start of class on Thursday.  The .tar.gz or .zip of your files should be sent to my @cs.cmu.edu account with the subject “Homework 5 Handin”. I suggest putting in a README file in case something is wrong, and it would be in your best interest to describe at a high level what your code is doing so that I can better assign partial credit!

Clarifying comments on memory usage, and a hint!

November 30, 2011

I’ve gotten some questions about what is an appropriate amount of memory to use when writing your map reduce tasks.  My previous post guides you towards writing a solution that is more in the spirit of a streaming algorithm, and less in the spirit of a traditional algorithm in which data is stored in RAM.  For example, storing the whole social graph of Twitter, FB, etc. in RAM is not feasible.  However, that doesn’t mean you can’t use some reasonable amount of memory.  Even JBiebs and Lady Gaga have a number of followers and friends on Twitter that could be stored in mapper or the reducer.  In this case, it’d be OK to write a reduce task that takes in all of Lady Gaga’s followers and do something interesting with it (such as compute the average account age of her followers).

With that said, here’s a bit of a hint:  Suppose I gave you two files, one which had user IDs and another which is a directed graph, and I ask you to output the subgraph induced by those vertices (edges for which both endpoints are in the set).  We’ll do this in two steps: The first will filter out edges (u,v) for which u isn’t in the set.  Then we’ll filter out the rest of the edges.  We do so as follows:

Mapper 1:

If the line is “user U”, we output “U found” and “user U”, where U is some number.  If the line is “edge U V”, we output “U V”

Reducer 1:

For a given key, we are going to store all of the values that we see for that key (if it’s numeric), otherwise we’ll just echo  the line.  We can keep this collection of values in a set as it will be small relative to the amount of data in the file.  After we read in all of the (U, value) pairs for a given U, we see if ‘found’ is in set of values.  If so, we write out “edge U V” for every V != “found” in the set of values for the key U.  Again, if the reducer sees a line of the form “user U”, we simply output “user U”.

Think about what we’ve accomplished at this point, and how you can finish the task.  It’s ok to take a given line and do things such as repeat it, output it in different ways, pack it together with some other data to make a more complex key, etc.

Hopefully this leads you toward a solution in which you store a relatively small amount of data! :)

A note about memory requirements and efficiency

November 28, 2011

Something that a few of you have noticed is a comment in the word count reducer that says you wouldn’t actually want to do things the way that I have.  The reason is that you shouldn’t make assumptions about the amount of data any particular reduce (or map) task is going to see.  If you have to keep state for potentially every line that you see, that’s A Bad Thing.  In particular, you really want to keep the amount of memory that you use to a minimum.  It’s much better to write a two-phase mapreduce algorithm in which every mapper and reducer uses a bounded amount of memory over one which requires only a single pass but for which the amount of required memory could be proportional to the input size.

You should write programs that require a small amount of memory.  Most of the time if you are storing some data into a dictionary where keys are the MR keys and all you do is increment some small set of values, you’re doing it ‘wrong.’  Yes, what you wrote is correct, but it will fail if I run your code on a file with billions of distinct keys.  The moral of the story is to take advantage of the fact that you see all of the (key,value) pairs for a given key contiguously.

With regard to efficiency, we are most concerned about memory efficiency.  In the real-world you want to be conscious of how much intermediate data you are writing to disk- all of those mapper output (key,value) pairs get written somewhere, sent over the network, and sorted somewhere else.  Often times we think about the complexity of a MR algorithm as the number of phases required and how much intermediate data gets produced.

Finally, the way that we’re faking the MR framework potentially enforces stronger conditions on the (key,value) ordering.  If you do mapper -> sort -> reducer, the values are going also be ordered.  I’m not exactly sure of the behavior of the shuffle task in Hadoop, so just be careful about writing code which requires that the values for a given key appear in some particular order.

Homework 5 Sample Data

November 25, 2011

I put up some small sample data that’s simple enough you should be able to debug it by hand.  Check out the directory http://www.scienceoftheweb.org/15-396/assignments_f11/hwk5_files/ for the files.

small_graph1.txt – A small graph with non-consecutive vertex IDs.  The degree count file is small_graph1_degree_dist.txt.

The other graph has two complete, directed graphs with components {1,2,3,4} and {10,11,12}.  There’s a hashtag file in which one HT is used by a user not in the edge list, one HT is used by two users in different components, and the other two HTs are used by some subset of users within each component.  Make sure that you print out the empty edge list (“#HT edges “) if there aren’t any edges in the graph between any two users who used a particular HT.  This is illustrated in the output file.

 

Homework 5 is out

November 19, 2011

I just posted homework 5.  It involves programming in Python, so if you’ve never programmed in Python before PLEASE get in touch with me.  You have a bit less than two weeks to complete the assignment, and there’s hopefully an enticing bonus project for you to do.

Please check out the homework over the weekend and come to class on Tuesday with questions.  I haven’t yet posted test data for the Map-Reduce problem and will do so as soon as I prepare a nice data set for you to look at!  

tl;dr- Read through the assignment.  Start thinking about the problems.  Come to class with questions.

Homework 4 is out!

November 4, 2011

Homework 4 is out.  It should be pretty short and sweet and is due on November 15.  Take a look at it over the weekend and come to us with questions next week!  We should also have your tests graded by Tuesday.

All Materials Updated + Notes on Recommendation Systems

October 27, 2011

All of the lectures have been posted as well as solutions to HWs 2 and 3.

http://public.research.att.com/~volinsky/netflix/sigkddexp.pdf has a nice discussion about the Netflix challenge.  The ‘rematch’ was canceled over privacy concerns, but the first one was completed and the $1M reward was given out.

This paper details a method by which using both the Netflix Challenge dataset and reviews from IMDB they were able to figure out the identities of some of the users!

Finally, you can look at the distribution of certain movie rentals in various cities on this NYT interactive page.  As always, feel free to email the staff any questions that you have about the course content!

 

 

A Note on Web Crawlers

October 18, 2011

Here are a few links that touch on a few of the ideas or topics we discussed today:

Check out http://www.drunkmenworkhere.org/219 to see how different search engines crawl the ‘webspider tree of death’.

Webrings: http://dir.webring.org/rw  (http://www.webring.org/hub/drtvgw for example)

Directories from the past: http://dir.yahoo.com/

 

 

 

 

 

 

 

 

 

 


Follow

Get every new post delivered to your Inbox.