Word cloud of Google Search Results using Python

Generate word cloud from top results for a Google search query

Photo by Thimo Pedersen on Unsplash

When you search something on Google, millions of results get thrown at you, of which you are likely to go through top few relevant ones. What if you get a snapshot of what has been written in the top results for a search query in the form of a word-cloud? Well, it will be interesting to see the content most spoken about the item you are searching for without perusing each result. Let’s see how to implement this in Python.

Here’s a simple flow-chart of the algorithm to create a word-cloud from the top 10 results of a Google search query. I have used the following modules in Python for the implementation.

google : Perform the search
urllib : Open and fetch contents from web pages (HTML version)
bs4 : Extract the relevant content from web page text (from HTML to XML text format)
wordcloud : Create word-cloud of a text doc

Flowchart of the full implementation

I have implemented some text manipulation code bits to improve the quality of the words in the word-cloud.

The python functions that play an important role in this implementation are mentioned below.

search(query, tld, lang, num, start, stop, pause) : Function in ‘googlesearch’ module that takes the arguments — query (search query), tld (top level domain like ‘com’ or ‘org’), lang (language), num (number of search results we want), start (where to start the retrieval), stop (which result to stop the search at), pause (number of seconds to pause between requests)

import google
from googlesearch import search
(query="Machine Learning", tld='com', lang='en', num=10, start=0, stop=10, pause=2.0)
Top 10 urls for the search query — ‘Machine Learning’

The above line of code will search for “Machine Learning” and will fetch 10 results starting from the very first result (0) with a pause time of 2 seconds between url requests.

urlopen() : Function in ‘urllib.request’ module to open a url which further has a ‘read’ attribute that extracts the HTML content of the web page

import urllib.request as url
url_link = "https://expertsystem.com/machine-learning-definition/"
url_content = url.urlopen(url_link).read()
HTML version of the url content

As you see, the unintelligible HTML content is being read when we call urlopen().read() function. Next step is to convert this to a readable text format using the lxml parser.

BeautifulSoup(html_doc, “lxml”).text : Function in ‘bs4’ module to convert the HTML doc to XML text format using the ‘lxml’ parser

from bs4 import BeautifulSoup
text_content = BeautifulSoup(url_content, "lxml").text
Text content extracted by ‘lxml’ parser

WordCloud(max_font_size, max_words).generate(text) : Function in ‘wordcloud’ module to create word clouds that takes the arguments — max_font_size () and max_words ()

from wordcloud import WordCloud
(max_font_size=100, max_words=100, background_color="white", random_state=0).generate(text_content)
Word Cloud for ‘Machine Learning’ from top search results

Do check out the full implementation in GitHub.

Here are some interesting results I obtained on some fun search queries.

Word Cloud for Kung Fu Panda

Apparently, jokes on onion price hike are the most popular in India!

Well, I expected Sheldon to be the biggest word for ‘Big Bang Theory’.

Albert Einstein’s brain is the most talked about!

You can improve the results by a great margin with some additional text manipulation steps. For example, set ‘film’ and ‘movie’ as stop-words if you are searching about a movie. Try changing the number of results you include to create the word cloud to see how different the results turn out to be.

Let me know in the comments if you get any particularly interesting Word Clouds with this implementation!