Research Question

How does the plot description of a film, as analyzed using Natural Language Processing techniques, relate to its overall success within the film industry?

Hypothesis

Null: The plot description of successful movies is not significantly different from that of other movies. Alternative: The plot description of successful movies is significantly different from that of other movies.

Introduction

The adage “beauty is in the eye of the beholder” eloquently captures the inherent challenge in objectively assessing the quality of artistic endeavors. This paper explores the possibility of using quantitative methods to correlate the uniqueness of a movie story with its industry success. We aim to find a statistically significant correlation between the metrics of a given movie plot and its acclaim in the industry. We theorize that movies with intriguing stories may possess special characteristics in their stories, which could, in turn, lead to greater industry success. Using Natural Language Processing, we created various metrics that measure uniqueness, relevance, indicativeness, and positivity of movie plots. These variables will help us fit a regression model to support our hypothesis. The significance of this approach lies in its contribution to the field of cultural sociology by attempting to demystify the underlying reasons for success in cultural industries. Our hope is that this methodology can be applied to other cultural products and industries as well. We recognize that movie plots alone are not the driving force behind the success of cinematic productions, but we aim to uncover the potential connection between movie plots and success. Our research is modeled after an academic paper by Holbrook and Addis (2008) in which they define a two-path model to measure industry success in motion pictures. Given the availability of reliable data, we decided to focus on the path that they define as industry success using our own approach. As they point out, the challenges of this type of research stem from the clash between empirical research on the connection between “artistic excellence” and success and the established qualitative theories on how quality reflects the consumption patterns of cultural products (p. 88) The authors highlight that the difficulties in this research area arise from the conflict between empirical studies investigating the relationship between artistic quality and success, and the prevailing qualitative theories explaining how quality influences the consumption patterns of cultural products (p. 88). In other words, challenging the prevailing assumption that cultural success is determined by ordinary people based on their collective assessment of artistic quality (p. 88-89). Although examining whether consumers prioritize artistic quality over popularity and marketing falls outside the purview of our study, it would be intriguing to explore other uses of comparable methods for evaluating the distinction in the relationship between film characteristics and the two routes to success, as delineated by Holbrook and Addis.

Literature Review

Our literature review includes papers that also study the cultural influence of movies using a computational approach, in addition to papers that study cultural products generally. While we did not follow all the methodology in these papers to craft our own, their approaches served as an inspiration that guided our methodology. Out of these papers the most influential ones were the Mauskapf and Askin (2017), Holbrook and Addis (2008), and Guan and Hung (2020).

1. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis 1

Reham, Malik, Raza and Ali showed the potential of the Hybrid CNN-LSTM Model for extracting useful information from the IMDB movie review dataset. This model combines the benefits of a CNN and an LSTM, providing an effective solution for text analysis. It has three major layers: an embedding layer, convolution layers, and an LSTM layer. The embedding layer assigns random weights to words and learns their embedding. The convolution layer convolves the input and reduces complexity with pooling layers. The LSTM layer utilizes its three gates and cells to manage information flow and capture dependencies between word sequences. The Hybrid CNN-LSTM Model achieved superior accuracy compared to single CNN and LSTM models on the two benchmark movie review datasets, demonstrating its effectiveness in sentiment analysis. For our project, we aim to explore the relationship between IndRec_index and movie plot description. To do this, we can use neural models to process large amounts of movie plot description and extract meaningful features. We can then use these features to predict IndRec_index, similar to the Hybrid CNN-LSTM Model which predicts sentiment based on a given movie script. If we are able to achieve a high accuracy in our predictions, it will validate the notion that there is a strong correlation between movie plot description and movie success.

2. The cultural environment: measuring culture with big data 2

Bail examines numerous text classification techniques in order to make the most of text samples. He assesses the advantages and drawbacks of each one. For instance, unsupervised clustering text classification requires changing every unique word in a text to a number, then calculating a similarity measure and grouping texts with a high degree of similarity. Nevertheless, the outcomes will be heavily reliant on how investigators pick the number of probable clusters as well as the mathematical distances employed in each algorithm. Other text classification models, such as Latent Dirichlet Allocation (LDA), use probabilistic models to detect latent themes or topics within a set of texts. However, this method supposes that the order of words within a text is not significant, which is not always accurate in certain scenarios. Furthermore, it assigns texts into distinct categories, disregarding the connection between topics. We can utilize text classification methods to divide our film scripts into different categories based on textual similarity and uncover the common textual features within each text group. By recognizing the benefits and constraints of each technique, we can use the techniques that are most appropriate for our study without adversely influencing our research conclusion.

3. Recommendations without user preferences: a natural language processing approach 3

The authors show the possibility of creating an automated system for movie recommendation without user preferences. They propose a “naïve word-space” approach based on content similarity using structured information (metadata) in addition to a Natural Language Processing technique to generate similarity scores between films. The authors then compare two test algorithms to analyze plot summaries: a word-space vector similarity metric and a topic signature genre similarity metric. After that, they evaluate their algorithms against a baseline, deemed ‘gold standard’, to conclude that the topic signature genre similarity metric outperforms the word-space approach but not the commercially-available, but human reliant, IMdB algorithm. This research helps us visualize how to create a paper that uses quantitative methods to generate topic similarity scores. These Topic signatures are lists of terms that are weighted by how indicative they are of a specific topic or genre. The authors generate topic signatures for each genre category as it is defined by IMDB using a statistical method based on term frequency and inverse document frequency. We want to expand their work by using more sophisticated NLP libraries that have come out in recent years, and new literature on the subject.

5. Art versus commerce in the movie industry: A Two-Path Model of Motion-Picture Success. 5

This paper proposes a novel way to measure the success of movies using two different approaches, one looking at the industry recognition and the other one at market performance. The authors argue that these two approaches are independent from each other and not correlated elements of what indicates success. They analyze a sample of 190 movies from the year 200 and create new variables for fitting a linear regression model. In their methodology they perform a Principal Components Analysis (PCA) with varimax rotation on a set of variables related to movie ratings and popularity, resulting in two uncorrelated factors (CritPopEval and CritPopBuzz) that are later incorporated as part of the equations for Industry Recognition (IndRecog) and Marker Performance (MktPerf). We will use their methodology as an inspiration for our own analysis, creating a 2-tier of dependent variables to measure both economic and industry success. We will use PCA to create an index of moderately-correlated (0.5+) variables that measure success in the academy, to create our own IndRec index that will be used as a dependent variable.

6. Winning box office with the right movie synopsis. 6

Hung and Guan measure the influence of linguistic cues of a synopsis (a summary of the plot) on the movie’s financial performance in Winning box office with the right movie synopsis (p. 594). By adopting text analysis, factor analysis, and structural equation modeling, they have shown that the language choice of the synopsis plays an important role in predicting the box office performance. They argue that having consistency between movie genres and linguistic cues promotes movie box revenue when these cues match with the expectations about that genre (p. 596). They found a statistically-significant relationship between some linguistic cues and box office performance (p. 608). Inspired by their research, we want to further investigate the relationship between the plot description and a movie’s success using a different dataset and our own NLP-generated independent variables.

Data 7

Data Description

For our project we used the “Movie Scripts Corpus” dataset from Kaggle because it was the most complete dataset we found that incorporated relevant metadata with text data. This dataset speaks to our research question because it combines movie plot data, as well as scripts, and metadata in one comprehensive package. All these are the elements that will be crucial to analyze how movie plots are related to various metrics found in the metadata provided by this dataset. For example, the dataset will allow us to look at the uniqueness of movie plots as they relate to the success in the box office of a movie. The dataset contains data and metadata of over 2,800 movies crawled from different publicly-available sources including IMdB, Metacritic, The Internet Movie Script Database, Academy Awards Screenplay, Reddit, and others. In total, the original dataset includes 25 variables ranging from synopsis, producers, cast, awards, won, and more. However, out of these variables we will be using the 12 most relevant variables:

  1. imdbid: A unique identifier assigned to each movie or TV show by the Internet Movie Database (IMDb), which is used to reference and organize the content within the database.

  2. metascore: A numerical score, typically ranging from 0 to 100, that aggregates and averages critic reviews and ratings from various sources, providing an overall assessment of a movie’s critical reception.

  3. awards: A list or count of accolades, recognitions, and honors a movie or TV show has received from film festivals, industry organizations, or critics’ associations, including nominations and wins.

  4. plot: A brief summary or description of the main storyline or narrative arc of a movie or TV show, highlighting the key events, conflicts, and character developments that drive the story forward.

  5. keywords: A collection of significant words or phrases that represent the main themes, subjects, or elements of a movie or TV show, often used for search and discovery purposes or to categorize content.

  6. genres: A set of categories or classifications that describe the overall style, tone, and subject matter of a movie or TV show, such as action, comedy, drama, science fiction, or romance.

Methods

Dependent Variable

We use IndRec_index to measure how successful a movie is. IndRec_index is created by a combination of three metrics – the number of Oscar awards, the number of other awards, and the metascore of the movie. For this index, since we are working with moderately-correlated variables, we decided to use Principal Component Analysis to create an index that incorporates the weights of each variable using the number of components that explain 80% of the variance. The result is a single score that combines the effect of our three candidate dependent variables into a single dependent variable that will be later used for our regression analysis. We used the IndRec variable from Holbrook & Addis, but the main similarity is that we used the same name and PCA as a method, but our approach was different (Holbrook & Addis, 2008).

Independent Variables:

We use three attributes - plot, keywords, and genre - to create our independent variables, which measure the uniqueness, indicativeness, relevance, and positiveness of each movie plot description. We create four independent variables in the following way:

  • Uniqueness: We suggest that the presence of a greater proportion of unique words within a plot description may signify a semantically richer narrative. In order to assess this, we apply text analysis methods, such as tokenizing and stemming, to identify the distinct words in a given text. We then determine the ratio of these unique words to the overall length of the plot description. The rationale behind this is that a varied vocabulary is essential for adequately conveying the different subplots of a story in a non-repetitive manner. As a result, a narrative with a restricted assortment of unique words will yield a lower uniqueness score. \(Uniqueness = \frac{m}{n}\), where m is the number of unique words.

  • Indicativeness: We start with an assumption that a successful movie should contain informative and indicative keywords within its plot. If a movie’s keywords are commonly shared among multiple films, they are not informative or indicative. We evaluate the informativeness and indicativeness of a movie’s keywords by calculating the frequency of each keyword in our dataset and taking its reciprocal. We then determine the average score of all keywords to obtain the movie’s overall informativeness and indicativeness score. \(Indicativeness = \frac{1}{\sum k_i}\), where k is the frequency of each keyword.

  • Relevance: We theorize that successful movies should have a plot description that aligns with consumers’ expectations of a particular movie genre. To assess a movie’s relevance to its genre, we utilize word embedding techniques to transform the plot description into an embedding vector. We then measure the distance between the movie’s plot vector and the center of its genre. The closer the movie vector is to the genre’s center, the more relevant it is to the genre. We derive the relevance score by taking the reciprocal of the distance. \(Relevance = \frac{1}{D}\), where D is the distance between the movie vector to the vector of the center of its genre.

  • Positiveness: We hypothesize that successful movies contain a sufficient amount of positive sentiment in their plot description. Although some tragedy can add depth to a story, an excessive amount of negativity can detract from the viewing experience. We evaluate the sentiment of a movie’s plot by employing a sentiment analysis package to compute the sentiment score. We refer to this metric as the “positiveness score.” Subsequently, we investigate how the positiveness score of a movie may influence its success.

Note: Building a Word2Vec and running TextBlob posed a challenge due to R’s limitations. Therefore, we used Python to create all the necessary variables and imported them into our Rmd file. To ensure consistency, we standardized both the dependent and independent variables after their creation.

Model

We conducted a regression analysis to examine the correlation between our dependent variable and the independent variables. Linear regression assumes a linear relationship between the dependent variable and one or more independent variables. We used linear regression to predict the IndRec_index using Uniqueness, Indicativeness, Relevance, and Positiveness. We calculated the Mean Squared Error (MSE) to compare the predicted IndRec_index with the actual IndRec_index. A lower MSE indicates a stronger correlation between IndRec_index and the four independent variables: Uniqueness, Indicativeness, Relevance, and Positiveness. The resulting linear regression model revealed correlation coefficients between IndRec_index and the four independent variables: Uniqueness, Indicativeness, Relevance, and Positiveness of -2.933, 0.006, 0.189, and 0.060 respectively. The MSE for the model was 0.7505286167945122, which we refer to as the observed MSE.

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Create separate instances of the linear regression models
model_uniqueness <- lm(IndRec_index ~ uniqueness, data = df_new)
model_indicativeness <- lm(IndRec_index ~ indictativeness, data = df_new)
model_relevance <- lm(IndRec_index ~ relevance, data = df_new)
model_polarity <- lm(IndRec_index ~ polarity, data = df_new)

# Use title caption from fig.cap
tit <- knitr::opts_current$get("fig.cap")

# Adding caption for html output
tit_html <- paste0('<span id="tab:',
                   knitr::opts_current$get("label"),
                   '">(#tab:',
                   knitr::opts_current$get("label"),
                   ')</span>',
                   tit)

# Create a stargazer table for the separate linear regression models
stargazer(model_uniqueness, model_indicativeness, model_relevance, model_polarity,
          label = paste0("tab:", knitr::opts_current$get("label")),
          title = ifelse(knitr::is_latex_output(), tit, tit_html),
          dep.var.caption = "DV: Industry Recognition Index",
          covariate.labels = c("Uniqueness", "Indicativeness", "Relevance", "Polarity"),
          column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
          notes.label = "Significance levels",
          type = ifelse(knitr::is_latex_output(),"latex","html"),
          header = FALSE
          )
(#tab:mytable)Separate Linear Regression Results
DV: Industry Recognition Index
IndRec_index
Model 1 Model 2 Model 3 Model 4
(1) (2) (3) (4)
Uniqueness -2.933***
(0.128)
Indicativeness 0.006
(0.020)
Relevance 0.189***
(0.019)
Polarity 0.060***
(0.020)
Constant 1.750*** 0.152*** 0.142*** 0.152***
(0.072) (0.019) (0.019) (0.019)
Observations 2,531 2,531 2,531 2,531
R2 0.172 0.00003 0.039 0.003
Adjusted R2 0.172 -0.0004 0.039 0.003
Residual Std. Error (df = 2529) 0.869 0.955 0.937 0.954
F Statistic (df = 1; 2529) 525.729*** 0.083 102.453*** 8.781***
Significance levels p<0.1; p<0.05; p<0.01
y_pred <- predict(model, X)
observed_mse <- mean((y - y_pred)^2)
observed_mse
## [1] 0.7504973

print(mean(mses < observed_mse))
## [1] 0

Experiment and Result

We have updated our hypothesis based on the variables we created and the Mean Squared Error (MSE) from linear regression:

Null hypothesis: There is no statistically significant difference in the MSE of the linear regression model between the IndRec_index and the four independent variables (Uniqueness, Indicativeness, Relevance, and Positiveness).

Alternative hypothesis: There is a statistically significant difference in the MSE of the linear regression model between the IndRec_index and the four independent variables.

To test our hypothesis, we use a permutation test. We perform 1,000 trials where we shuffle the IndRec_index column, fit a linear regression model with the shuffled IndRec_index column and the four independent variables, and calculate the MSE. This gives us 1,000 different linear regression models with 1,000 different MSEs. We plot the distribution of the 1,000 MSEs, and add a red line to represent the observed MSE. If our null hypothesis is correct, the observed MSE should fall within the distribution. However, the red line is far to the left of the distribution, and we obtain a corresponding p-value of 0. Therefore, we reject the null hypothesis and favor the alternative hypothesis. This suggests that there is a statistically significant correlation between the IndRec_index and the four independent variables (Uniqueness, Indicativeness, Relevance, and Positiveness). Based on this result, we conclude that the plot description of successful movies is significantly different from that of other movies, supporting our original alternative hypothesis.

Discussion and Limitation

Based on the results presented above, it appears that the plot descriptions of successful movies differ significantly from those of other movies, indicating that successful movies possess unique characteristics in their plot descriptions. This finding is encouraging, as it raises important business questions and suggests that the ability to accurately predict a movie’s success using only its plot description could have significant implications for the movie industry. However, we observed that indicators such as indicativeness, relevance, and polarity show a low correlation coefficient of the regression with the IndRec_index. Furthermore, while uniqueness has a high correlation coefficient, the negative value contradicts our initial hypothesis that a greater proportion of unique words in a plot description would lead to a more semantically rich narrative and greater success. While we know that there is a correlation between plot description quality and movie success, we are unable to determine which specific characteristics of plot descriptions are correlated with success, and therefore cannot provide the movie industry with direction on how to improve plot descriptions. We intend to explore this further in future work.

Our optimism on this type of approach is given by the fact that the most sophisticated of our measures, relevance, has both a positive correlation to Industry Recognition and is statistically significant. Having said that, we are under no illusion that this measure alone is a strong cannot explain the variance in industry recognition, as its R-squared value is only 0.039. This suggests that for a more comprehensive analysis of artistic quality as a variable that determines industry success, it is imperative to look at other measures that go beyond the story alone. Future research should seek to create new features using computational methods similar to the measure of musicality created to analyze musical genres in (Mauskapf and Askin, 2017). We can only guess why the uniqueness score led to contradictory results, but it is possible that given that the uniqueness itself is not a comparative measure, as the text uniqueness is relative to itself (similar to a repetition measure relative to the other words used within each plot). This approach used could explain the inconclusive/contradictory results.

Conclusion

In conclusion, this study aimed to explore the correlation between the plot description of a movie and its overall success in the film industry. Using Natural Language Processing, we created various metrics to measure uniqueness, relevance, indicativeness, and positivity of movie plots, and used the “Movie Scripts Corpus” dataset to gather relevant metadata and text data. We used IndRec_index, a combination of the number of Oscar awards, the number of other awards, and the metascore of the movie, as a dependent variable to measure the success of a movie, and three independent variables to measure the uniqueness, relevance, and positivity of the movie plot. Our analysis supports our alternative hypothesis that the plot description of successful movies is significantly different from that of other movies. This study contributes to the field of cultural sociology by using quantitative methods to understand the underlying reasons for success in cultural industries. Our hope is that this methodology can be applied to other cultural products and industries as well. However, we recognize that movie plots alone are not the driving force behind the success of cinematic productions, and further research is needed to examine other factors that contribute to industry success.

Expanding this approach can contribute to cultural sociology by employing quantitative methods to uncover the underlying factors driving success in cultural industries but also can be a useful foundation for other interested researchers. We anticipate that our methodology could be used to analyze other cultural products, further enriching our understanding of success in diverse cultural domains such as movies. Nonetheless, we recognize that movie plots alone do not entirely determine the success of a film, and that additional research is essential to find out other contributing factors. These factors may help create a more holistic picture of a movie’s success. By expanding the scope of the analysis and refining methodology, future research in this area could offer valuable insights for the film industry and contribute to creating high quality cinematic productions.


  1. Rehman, A.U., Malik, A.K., Raza, B. et al. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis. Multimed Tools Appl 78, 26597–26613 (2019). https://doi.org/10.1007/s11042-019-07788-7.↩︎

  2. Bail, Christopher A. “The cultural environment: Measuring culture with big data.” Theory and Society 43 (2014): 465-482.↩︎

  3. Fleischman, M., & Hovy, E. (2003, January). Recommendations without user preferences: a natural language processing approach. In Proceedings of the 8th international conference on Intelligent user interfaces (pp. 242-244).↩︎

  4. Askin, N., & Mauskapf, M. (2017). What Makes Popular Culture Popular? Product Features and Optimal Differentiation in Music. American Sociological Review, 82, 910 - 944.↩︎

  5. Holbrook, M. B., & Addis, M. (2008). Art versus commerce in the movie industry: A Two-Path Model of Motion-Picture Success. Journal of Cultural Economics, 32(2), 87–107. https://doi.org/10.1007/s10824-007-9059-2↩︎

  6. Hung, Y.-C., & Guan, C. (2020). Winning box office with the right movie synopsis. European Journal of Marketing, ahead-of-print. https://doi.org/10.1108/EJM-01-2019-0096↩︎

  7. Gur, K. (n.d.). Movie Scripts Corpus. Kaggle.Com. Retrieved March 24, 2023, from https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus↩︎