How to Feed Mallet Data to Ldavis

Have you ever wanted to classify news, papers, or tweets based on their topics? Knowing how to do this can help you filter out irrelevant documents, and save time by reading only what you're interested in.

That's what text classification is for – allows you to train your model to recognize topics. This technique allows you to use data labels to train your model, and it's supervised learning.

text classification

In real life, you might not have data labels for text classification. You can go through each document to label them, or hire somebody else to do it, but that's a lot of time and money, especially when you have more than 1000 data points.

Can you find the topics of your documents without training data? Yes, you can use topic modeling to do it.

What is topic modeling?

With topic modeling, you can cluster words for a set of documents. This is unsupervised learning, because it automatically groups words without a predefined list of labels.

If you feed the model data, it will give you different sets of words, and each set of words describes the topic.

(0,          '0.024*          "ban"          +          0.017*"order"          +          0.015*"refugee"          +          0.015*"law"          +          0.013*"trump"          '          '+          0.011*"kill"          +          0.011*"country"          +          0.010*"attack"          +          0.009*"state"          + '          '0.009*          "immigration"') (1,          '0.020*          "student"          +          0.020*"work"          +          0.019*"great"          +          0.017*"learn"          + '          '0.017*          "school"          +          0.015*"talk"          +          0.014*"support"          +          0.012*"community"          + '          '0.010*          "share"          +          0.009*"event")

When you look at the first set of words, you would guess the topic is military and politics. Looking at the second set of words, you might guess the topic is public events, or school.

This is quite useful. Your texts are automatically categorized, without the need to label them!

Visualize topic modeling with pyLDAvis

Topic modeling is useful, but it's difficult to understand it just by looking at a combination of words and numbers like above.

One of the most effective ways to understand data is through visualization. Is there a way that we can visualize the results of LDA? Yes, we can with pyLDAvis.

PyLDAvis allows us to interpret the topics in a topic model like below:

PyLDAvis

Pretty cool, isn't it? Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results. We'll analyze a real Twitter dataset containing 6000 tweets.

Let's see what topics we can find.

How to start with pyLDAvis and how to use it

Install pyLDAvis with:

pip install pyldavis

The script to process the data can be found here. Download the data after being processed.

Moving on, let's import relevant libraries:

          import          gensim          import          gensim.corpora          as          corpora          from          gensim.corpora          import          Dictionary          from          gensim.models.coherencemodel          import          CoherenceModel          from          gensim.models.ldamodel          import          LdaModel          from          pprint          import          pprint          import          spacy          import          pickle          import          re          import          pyLDAvis          import          pyLDAvis.gensim          import          matplotlib.pyplot          as          plt          import          pandas          as          pd

If you want to get access to the data above and follow along with the article, download the data and put the data in your current directory, then run:

tweets = pd.read_csv('dp-export-8940.csv')  tweets = tweets.Tweets.values.tolist()   tweets = [t.split(',')          for          t          in          tweets]

How to use LDA Model

Topic modeling involves counting words and grouping similar word patterns to describe topics within the data. If the model knows the word frequency, and which words often appear in the same document, it will discover patterns that can group different words together.

We start with converting a collection of words to a bag of words, which is a list of tuples (word_id, word_frequency). gensim.corpora.Dictionary is a great tool for this:

id2word = Dictionary(tweets)  corpus = [id2word.doc2bow(text)          for          text          in          tweets] print(corpus[:1])  [[(0,          1), (1,          1), (2,          1), (3,          3), (4,          1), (5,          2), (6,          2), (7,          1), (8,          1), (9,          1), (10,          1), (11,          2), (12,          2), (13,          1), (14,          1), (15,          1), (16,          2), (17,          1), (18,          1), (19,          1), (20,          2), (21,          1), (22,          1), (23,          1), (24,          1), (25,          2), (26,          1), (27,          1), (28,          1), (29,          1), (30,          1), (31,          1), (32,          1), ... , (347,          1), (348,          1), (349,          2), (350,          1), (351,          1), (352,          1), (353,          1), (354,          1), (355,          1), (356,          1), (357,          1), (358,          1), (359,          1), (360,          1), (361,          1), (362,          2), (363,          1), (364,          4), (365,          1), (366,          1), (367,          3), (368,          1), (369,          8), (370,          1), (371,          1), (372,          1), (373,          4)]]

What do these tuples mean? Let's convert them into human readable format to understand:

[[(id2word[i], freq)          for          i, freq          in          doc]          for          doc          in          corpus[:1]]  [[("'d",          1),   ('-',          1),   ('absolutely',          1),   ('aca',          3),   ('act',          1),   ('action',          2),   ('add',          2),   ('administrative',          1),   ('affordable',          1),   ('allow',          1),   ('amazing',          1), ...   ('way',          4),   ('week',          1),   ('well',          1),   ('will',          3),   ('wonder',          1),   ('work',          8),   ('world',          1),   ('writing',          1),   ('wrong',          1),   ('year',          4)]]

Now let's build an LDA topic model. We will use gensim.models.ldamodel.LdaModel for this:

            lda_model = LdaModel(corpus=corpus,                    id2word=id2word,                    num_topics=10,                     random_state=0,                    chunksize=100,                    alpha='auto',                    per_word_topics=True)  pprint(lda_model.print_topics()) doc_lda = lda_model[corpus]        

There seem to be some patterns here. The first topic may be politics, and the second topic may be sport, but the pattern is not clear.

[(0,          '0.017*"go" + 0.013*"think" + 0.013*"know" + 0.010*"time" + 0.010*"people" + '          '0.008*"good" + 0.008*"thing" + 0.007*"feel" + 0.007*"need" + 0.007*"get"'), (1,          '0.020*"game" + 0.019*"play" + 0.019*"good" + 0.013*"win" + 0.012*"go" + '          '0.010*"look" + 0.010*"great" + 0.010*"team" + 0.010*"time" + 0.009*"year"'), (2,          '0.029*"video" + 0.026*"new" + 0.021*"like" + 0.020*"day" + 0.019*"today" + '          '0.015*"check" + 0.014*"photo" + 0.009*"post" + 0.009*"morning" + '          '0.009*"year"'), (3,          '0.186*"more" + 0.058*"today" + 0.021*"pisce" + 0.016*"capricorn" + '          '0.015*"cancer" + 0.015*"aquarius" + 0.013*"arie" + 0.008*"feel" + '          '0.008*"gemini" + 0.006*"idea"'), (4,          '0.017*"great" + 0.011*"new" + 0.010*"thank" + 0.010*"work" + 0.008*"good" + '          '0.008*"look" + 0.007*"how" + 0.006*"learn" + 0.005*"need" + 0.005*"year"'), (5,          '0.028*"thank" + 0.026*"love" + 0.017*"good" + 0.013*"day" + 0.010*"year" + '          '0.010*"look" + 0.010*"happy" + 0.010*"great" + 0.010*"time" + 0.009*"go"')]        

Let's use pyLDAvis to visualize the topics:

PyLDAvis visualization

Click here to interact with the visualization yourself.

  • Each bubble represents a topic. The larger the bubble, the higher percentage of the number of tweets in the corpus is about that topic.
  • Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.
  • Red bars give the estimated number of times a given term was generated by a given topic. As you can see from the image below, there are about 22,000 of the word 'go', and this term is used about 10,000 times within topic 1. The word with the longest red bar is the word that is used the most by the tweets belonging to that topic.
PyLDAvis visualization
  • The further the bubbles are away from each other, the more different they are. For example, it is difficult to tell the difference between topics 1 and 2. They seem to be both about social life, but it is much easier to tell the difference between topics 1 and 3. We can tell that topic 3 is about politics.
pyLDAvis visualization

A good topic model will have big and non-overlapping bubbles scattered throughout the chart. As we can see from the graph, the bubbles are clustered within one place. Can we do better than this?

Yes, because luckily, there is a better model for topic modeling called LDA Mallet.

How to use LDA Mallet Model

Our model will be better if the words in a topic are similar, so we will use topic coherence to evaluate our model. Topic coherence evaluates a single topic by measuring the degree of semantic similarity between high scoring words in the topic. A good model will generate topics with high topic coherence scores.

            coherence_model_lda = CoherenceModel(model=lda_model, texts=tweets, dictionary=id2word, coherence='c_v') coherence_lda = coherence_model_lda.get_coherence() print('\\nCoherence Score: ', coherence_lda)  Coherence Score:          0.3536443343685833        

This is our baseline. We have just used Gensim's inbuilt version of the LDA algorithm, but there is an LDA model that provides better quality of topics called the LDA Mallet Model.

Let's see if we can do better with LDA Mallet.

mallet_path =          'patt/to/mallet-2.0.8/bin/mallet'           ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)   pprint(ldamallet.show_topics(formatted=False))   coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=tweets, dictionary=id2word, coherence='c_v') coherence_ldamallet = coherence_model_ldamallet.get_coherence() print('\\nCoherence Score: ', coherence_ldamallet)  Coherence Score:          0.38780981858635866        

The coherence score is better! Can the score be better if we increase or decrease the number of topics? Let's find it out by fine-tuning the model. This tutorial provides an excellent explanation of how to tune the LDA model. Below is the source code from the article:

                      def            compute_coherence_values            (dictionary, corpus, texts, limit, start=2, step=3):          """     Compute c_v coherence for various number of topics      Parameters:     ----------     dictionary : Gensim dictionary     corpus : Gensim corpus     texts : List of input texts     limit : Max num of topics      Returns:     -------     model_list : List of LDA topic models     coherence_values : Coherence values corresponding to the LDA model with respective number of topics     """          coherence_values = []     model_list = []          for          num_topics          in          range(start, limit, step):         model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)         model_list.append(model)         coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')         coherence_values.append(coherencemodel.get_coherence())          return          model_list, coherence_values  model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=tweets, start=2, limit=40, step=4)   limit=40; start=2; step=4; x = range(start, limit, step) plt.plot(x, coherence_values) plt.xlabel("Num Topics") plt.ylabel("Coherence score") plt.legend(("coherence_values"), loc='best') plt.show()
coherence score

It looks like the coherence score increases with the increase in the number of topics. We will use the model with the highest coherence score:

best_result_index = coherence_values.index(max(coherence_values)) optimal_model = model_list[best_result_index]  model_topics = optimal_model.show_topics(formatted=False) print(f'''The {x[best_result_index]} topics gives the highest coherence score \\ of {coherence_values[best_result_index]}''')

The 34 topics give the highest coherence score of 0.3912.

Awesome! We get a better coherence score. Let's see how the words are clustered using pyLDAVis.

In order to visualize our model using pyLDAVis, we need to convert the LDA Mallet model into the LDA Model.

                      def            convertldaGenToldaMallet            (mallet_model):          model_gensim = LdaModel(         id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,         alpha=mallet_model.alpha, eta=0,     )     model_gensim.state.sstats[...] = mallet_model.wordtopics     model_gensim.sync_state()          return          model_gensim  optimal_model = convertldaGenToldaMallet(optimal_model)

You can access the tuned model here . Then visualize with pyLDAvis:

            pyLDAvis.enable_notebook() p = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word) p
PyLDAvis visualization

Click here to visualize yourself. It is easier to distinguish between different topics now.

  • The 1st bubble seems to be about personal relationships
  • The 2nd bubble seems to be about politics
  • The 5th bubble seems to be about positive social events
  • The 6th bubble seems to be about football
  • The 7th bubble seems to be about household
  • The 27th bubble seems to be about sports

And many more. Do you have different guesses for the topics of these bubbles?

Conclusion

Thanks for reading. Hopefully you've learned what topic modeling is about, and how to visualize the results of your model with pyLDAvis.

Although topic modeling is not as accurate as text classification, it is worth it if you don't have enough time and resources to label your data. Why not try an easier solution before coming up with something more sophisticated and more time-consuming?

You can play with the code in this article here. I encourage you to apply this code on your own data and see what you get.


READ NEXT

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

11 mins read | Author Shahul ES | Updated July 14th, 2021

Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. Butwhich tools you should choose to explore and visualize text data efficiently?

In this article, we willdiscuss and implement nearly all the major techniques that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done.

Before we start: dataset and dependencies

In this article, we will use a million news headlines dataset from Kaggle. If you want to follow the analysis step-by-step you may want to install the following libraries:

pip install \    pandas matplotlib numpy \    nltk seaborn sklearn gensim pyldavis \    wordcloud textblob spacy textstat

Now, we can take a look at the data.

news= pd.read_csv('data/abcnews-date-text.csv',nrows=10000) news.head(3)
jupyter output

The dataset contains only two columns, the published date, and the news heading.

For simplicity, I will be exploring the first10000 rows from this dataset. Since the headlines are sorted bypublish_dateit is actually2 months fromFebruary/19/2003 untilApril/07/2003.

Ok, I think we are ready to start our data exploration!

Continue reading ->


johnsonhateep.blogspot.com

Source: https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know

0 Response to "How to Feed Mallet Data to Ldavis"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel