Python word cloud from a html page: Difference between revisions

From wikiluntti
 
(2 intermediate revisions by the same user not shown)
Line 29: Line 29:


<pre>
<pre>
#The word cloud
from wordcloud import WordCloud, STOPWORDS
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt


stopwords = set(STOPWORDS)
stopwords = set(STOPWORDS)
#.generate_from_frequencies()


wordcloud = WordCloud(width = 800, height = 800,
wordcloud = WordCloud(width = 800, height = 800,
Line 44: Line 39:
                 collocations=False,  
                 collocations=False,  
                 min_font_size = 10).generate( ' '.join(map(str, bwords)))
                 min_font_size = 10).generate( ' '.join(map(str, bwords)))
</pre>
</pre>


Line 74: Line 67:


</pre>
</pre>
== The code ==
See the code at [https://gist.github.com/markkuleino/5173152fdd8fa8877c433d300eb71d45 Github].


== Exercises ==
== Exercises ==

Latest revision as of 00:20, 19 August 2021

Introduction

Analyze html tables using word clouds.

Theory

Fetching the table

Data scraping is easiest using Pandas. BeautifulSoup is an other good option.

Linguistic analyzation

The Finnish language is used, thus Voikko morphological analyzer is used to lemmatize the words into the base format.

sudo apt -y install -y voikko-fi python-libvoikko
pip3 install libvoikko

References

https://data.solita.fi/finnish-stemming-and-lemmatization-in-python/

See Tarmo perusmuodoistaja

The word cloud

WordCloud library is easy to use. It can create svg graphics, but fonts might get mixed.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

stopwords = set(STOPWORDS)

wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                collocations=False, 
                min_font_size = 10).generate( ' '.join(map(str, bwords)))

Save in svg format

name = title + "_" + AS.replace(" ","") 

wordcloud_svg = wordcloud.to_svg(embed_font=True)
f = open(name + ".svg","w+")
f.write(wordcloud_svg )
f.close()

# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

#plt.rcParams['svg.fonttype'] = 'non

plt.rcParams["savefig.format"] = 'png'

print( name )
plt.savefig( name )

plt.show()

The code

See the code at Github.

Exercises