![]() |
![]() |
I scraped tvtropes.com. Basically I looped through "https://tvtropes.org/pmwiki/pmwiki.php/AmericanFilms/*" to get all the movies.
Then I looped through those movies looking for any links to tropes in the main ul. Something like this:
def update_movie(suffix):
page = download("https://tvtropes.org" + suffix)
movie = BeautifulSoup(page)
tropes = set()
for ul in movie.find_all("ul"):
for a in ul.find_all("a", attrs={"class": "twikilink"}):
if a["href"].find("/Main/") >= 0:
tropes.add(a["href"])
for trope in tropes:
# Insert data into the table
cursor.execute("INSERT INTO tropes (movie, trope) VALUES (?, ?)", (suffix, trope))
Some scraping tricks let me run these multitasked. tvtropes.com allows bots and is well-formatted, so this was pretty straightforward.
Now we can do some basic queries:
The ten movies with the most tropes:
The ten most common tropes are:
We look also at H-Index defined as: A movie with an H-Index of n has at least n tropes, each of which appear in at least n movies.
We think there is heavy bias towards films that are popular among tvtropes.com contributors. The highest grossing films are all action movies, and mostly recent. A priori we might expect older films, or films of other genres to score high as well.
There seems to be a meaningful difference between overall count and H-Index. We argue that H-Index should be the source of truth for "trope-i-est movie" because we think of a movie as more trope-y both when it has a lot of tropes AND when those tropes are very common.
The H-Index seems to have an homogenizing effect, with the top 8 movies being super hero movies from the last twenty years. This is probably because many super hero movies in the last twenty years use some of the same tropes. It's worth noting that the tropes in The Godfather are probably popular because they were in The Godfather. (To a lesser extent the same could be said for The Dark Knight.) An interesting second analysis might look at trope counts by how often the trope had happened before the movie.