Python word cloud from a html page: Difference between revisions
From wikiluntti
(4 intermediate revisions by the same user not shown) | |||
Line 25: | Line 25: | ||
=== The word cloud === | === The word cloud === | ||
WordCloud library is easy to use. It can create svg graphics, but fonts might get mixed. | |||
<pre> | |||
from wordcloud import WordCloud, STOPWORDS | |||
import matplotlib.pyplot as plt | |||
stopwords = set(STOPWORDS) | |||
wordcloud = WordCloud(width = 800, height = 800, | |||
background_color ='white', | |||
stopwords = stopwords, | |||
collocations=False, | |||
min_font_size = 10).generate( ' '.join(map(str, bwords))) | |||
</pre> | |||
=== Save in svg format === | === Save in svg format === | ||
<pre> | |||
name = title + "_" + AS.replace(" ","") | |||
wordcloud_svg = wordcloud.to_svg(embed_font=True) | |||
f = open(name + ".svg","w+") | |||
f.write(wordcloud_svg ) | |||
f.close() | |||
# plot the WordCloud image | |||
plt.figure(figsize = (8, 8), facecolor = None) | |||
plt.imshow(wordcloud) | |||
plt.axis("off") | |||
plt.tight_layout(pad = 0) | |||
#plt.rcParams['svg.fonttype'] = 'non | |||
plt.rcParams["savefig.format"] = 'png' | |||
print( name ) | |||
plt.savefig( name ) | |||
plt.show() | |||
</pre> | |||
== The code == | |||
See the code at [https://gist.github.com/markkuleino/5173152fdd8fa8877c433d300eb71d45 Github]. | |||
== Exercises == | == Exercises == |
Latest revision as of 00:20, 19 August 2021
Introduction
Analyze html tables using word clouds.
Theory
Fetching the table
Data scraping is easiest using Pandas. BeautifulSoup is an other good option.
Linguistic analyzation
The Finnish language is used, thus Voikko morphological analyzer is used to lemmatize the words into the base format.
sudo apt -y install -y voikko-fi python-libvoikko pip3 install libvoikko
References
https://data.solita.fi/finnish-stemming-and-lemmatization-in-python/
The word cloud
WordCloud library is easy to use. It can create svg graphics, but fonts might get mixed.
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt stopwords = set(STOPWORDS) wordcloud = WordCloud(width = 800, height = 800, background_color ='white', stopwords = stopwords, collocations=False, min_font_size = 10).generate( ' '.join(map(str, bwords)))
Save in svg format
name = title + "_" + AS.replace(" ","") wordcloud_svg = wordcloud.to_svg(embed_font=True) f = open(name + ".svg","w+") f.write(wordcloud_svg ) f.close() # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) #plt.rcParams['svg.fonttype'] = 'non plt.rcParams["savefig.format"] = 'png' print( name ) plt.savefig( name ) plt.show()
The code
See the code at Github.