Word cloud of Google Search Results using Python
When you search something on Google, millions of results get thrown at you, of which you are likely to go through top few relevant ones. What if you get a snapshot of what has been written in the top results for a search query in the form of a word-cloud? Well, it will be interesting to see the content most spoken about the item you are searching for without perusing each result. Let’s see how to implement this in Python.
Creating Word-cloud from top results of a Google Search Query
Here’s a simple flow-chart of the algorithm to create a word-cloud from the top 10 results of a Google search query. I have used the following modules in Python for the implementation.
google : Perform the search
urllib : Open and fetch contents from web pages (HTML version)
bs4 : Extract the relevant content from web page text (from HTML to XML text format)
wordcloud : Create word-cloud of a text doc
I have implemented some text manipulation code bits to improve the quality of the words in the word-cloud.
Key Python Functions
The python functions that play an important role in this implementation are mentioned below.
search(query, tld, lang, num, start, stop, pause) : Function in ‘googlesearch’ module that takes the arguments — query (search query), tld (top level domain like ‘com’ or ‘org’), lang (language), num (number of search results we want), start (where to start the retrieval), stop (which result to stop the search at), pause (number of seconds to pause between requests)
from googlesearch import search
search(query="Machine Learning", tld='com', lang='en', num=10, start=0, stop=10, pause=2.0)
The above line of code will search for “Machine Learning” and will fetch 10 results starting from the very first result (0) with a pause time of 2 seconds between url requests.
urlopen() : Function in ‘urllib.request’ module to open a url which further has a ‘read’ attribute that extracts the HTML content of the web page
import urllib.request as url
url_link = "https://expertsystem.com/machine-learning-definition/"
url_content = url.urlopen(url_link).read()
As you see, the unintelligible HTML content is being read when we call urlopen().read() function. Next step is to convert this to a readable text format using the lxml parser.
BeautifulSoup(html_doc, “lxml”).text : Function in ‘bs4’ module to convert the HTML doc to XML text format using the ‘lxml’ parser
from bs4 import BeautifulSoup
text_content = BeautifulSoup(url_content, "lxml").text
WordCloud(max_font_size, max_words).generate(text) : Function in ‘wordcloud’ module to create word clouds that takes the arguments — max_font_size () and max_words ()
from wordcloud import WordCloud
WordCloud(max_font_size=100, max_words=100, background_color="white", random_state=0).generate(text_content)
Do check out the full implementation in GitHub.
Here are some interesting results I obtained on some fun search queries.
Apparently, jokes on onion price hike are the most popular in India!
Well, I expected Sheldon to be the biggest word for ‘Big Bang Theory’.
Albert Einstein’s brain is the most talked about!
Improving the results
You can improve the results by a great margin with some additional text manipulation steps. For example, set ‘film’ and ‘movie’ as stop-words if you are searching about a movie. Try changing the number of results you include to create the word cloud to see how different the results turn out to be.
Let me know in the comments if you get any particularly interesting Word Clouds with this implementation!