Jekyll2024-01-12T17:45:17-05:00https://hussein.space/feed.xmlHussein S. Al-Olimat, PhDhalolimat website/blogHusseinhussein.alolimat@gmail.comArabicNLP Dataset2020-12-05T11:24:00-05:002020-12-05T11:24:00-05:00https://hussein.space/2020/ArabicNLP-Datasets<table>
<thead>
<tr>
<th> </th>
<th>URL</th>
</tr>
</thead>
<tbody>
<tr>
<td>Papers</td>
<td>https://arxiv.org/ftp/arxiv/papers/1702/1702.07835.pdf</td>
</tr>
<tr>
<td>Sentiment Analysis</td>
<td>http://archive.ics.uci.edu/ml/datasets/Twitter+Data+set+for+Arabic+Sentiment+Analysis</td>
</tr>
<tr>
<td>Classification</td>
<td>http://archive.ics.uci.edu/ml/datasets/Opinion+Corpus+for+Lebanese+Arabic+Reviews+%28OCLAR%29</td>
</tr>
<tr>
<td>Wikipedia</td>
<td>https://archive.org/details/arwiki-20190201</td>
</tr>
<tr>
<td>Multiple</td>
<td>https://lionbridge.ai/datasets/best-arabic-datasets-for-machine-learning/</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/ibnmalik/golden-corpus-arabic/tree/develop/core</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/linuxscout/tashkeela2</td>
</tr>
<tr>
<td>Diacritization</td>
<td>https://github.com/AliOsm/arabic-text-diacritization/tree/master/dataset</td>
</tr>
<tr>
<td> </td>
<td>http://tanzil.net/download</td>
</tr>
<tr>
<td> </td>
<td>https://www.kaggle.com/datasets/linuxscout/tashkeela?resource=download</td>
</tr>
<tr>
<td>Crawls</td>
<td>https://traces1.inria.fr/oscar/</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/alisafaya/Arabic-BERT</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/mohataher/arabic_big_corpus</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/aosaimy/riyadh-corpus-collection</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/anastaw/Arabic-Wikipedia-Corpus/blob/master/Wikipedia-Corpus-30-08-10.sql.gz</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/antcorpus/RSSCrawlerArabicCorpus</td>
</tr>
<tr>
<td> </td>
<td>BBC Crawl: https://github.com/motazsaad/bbc-crawler</td>
</tr>
<tr>
<td>News</td>
<td>https://github.com/parallelfold/SaudiNewsNet</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/motazsaad/Arabic-News</td>
</tr>
<tr>
<td>ArabicWeb16</td>
<td>Labeled Dataset: https://sites.google.com/view/arabicweb16/download/labelled-datasets?authuser=0</td>
</tr>
<tr>
<td> </td>
<td>Sample big: https://sites.google.com/view/arabicweb16/getting-started?authuser=0 https://drive.google.com/drive/folders/0B6P2zR7VKiV4SWdITFlXcmxObWM</td>
</tr>
<tr>
<td>Raw</td>
<td>https://github.com/Islamicate-DH/arabicCorpus</td>
</tr>
<tr>
<td>NER</td>
<td>https://github.com/EmnamoR/Arabic-named-entity-recognition</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/RamziSalah/Classical-Arabic-Named-Entity-Recognition-Corpus</td>
</tr>
<tr>
<td> </td>
<td>https://www.cs.cmu.edu/~ark/ArabicNER/</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/juand-r/entity-recognition-datasets</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/oudalab/Arabic-NER</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/EmnamoR/Arabic-named-entity-recognition/tree/master/</td>
</tr>
<tr>
<td>Keyphrase</td>
<td>https://github.com/ailab-uniud/akec</td>
</tr>
<tr>
<td>Sentiment Analysis</td>
<td>https://github.com/nora-twairesh/AraSenti</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/almoslmi/masc</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/marwanalomari/Sentiment-Classifier-Logistic-Regression-for-Arabic-Services-Reviews-in-Lebanon</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/komari6/Arabic-twitter-corpus-AJGT</td>
</tr>
<tr>
<td> </td>
<td>https://tahatobaili.github.io/project-rbz/</td>
</tr>
<tr>
<td>Speech data</td>
<td>http://www.cs.stir.ac.uk/~lss/arabic/</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/Anwarvic/Arabic-Speech-Recognition</td>
</tr>
<tr>
<td>Tashkeel</td>
<td>https://github.com/Anwarvic/Tashkeela-Model</td>
</tr>
<tr>
<td>Speech to text</td>
<td>https://github.com/motazsaad/jsc-news-broadcast</td>
</tr>
<tr>
<td>Misspellings</td>
<td>https://github.com/linuxscout/aghlat</td>
</tr>
<tr>
<td>Arabic/English Translation</td>
<td>https://github.com/meedan/news-memory</td>
</tr>
<tr>
<td>Poetry</td>
<td>https://github.com/d7eame/Matn</td>
</tr>
<tr>
<td>WordEmbeddings</td>
<td>http://mazajak.inf.ed.ac.uk:8000/</td>
</tr>
<tr>
<td>Stories</td>
<td>https://github.com/motazsaad/Arabic-Stories-Corpus</td>
</tr>
<tr>
<td>Dialects</td>
<td>https://github.com/motazsaad/corpus2json/tree/master/corpora/nizar_arabic_dialects</td>
</tr>
<tr>
<td>Opinion Mining</td>
<td>https://github.com/AhmedObaidi/omcca</td>
</tr>
<tr>
<td>POS + rel</td>
<td>https://github.com/salsama/Arabic-Information-Extraction-Corpus</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/qcri/dialectal_arabic_pos_tagger</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/seloufian/Arabic-PoS-Tagger</td>
</tr>
<tr>
<td>Annotated per nationality</td>
<td>https://github.com/Data-Science-for-Linguists-2020/Arabic-Learner-Corpus-Considerations</td>
</tr>
<tr>
<td>OCR</td>
<td>Digits: https://www.kaggle.com/mloey1/ahdd1</td>
</tr>
<tr>
<td> </td>
<td>Letters: https://www.kaggle.com/mloey1/ahcd1</td>
</tr>
<tr>
<td> </td>
<td>https://cactus.orange-labs.fr/ALIF/download.html</td>
</tr>
<tr>
<td> </td>
<td>http://www.ccse.kfupm.edu.sa/~husni/ArabicOCR/PATS-A02.htm</td>
</tr>
<tr>
<td> </td>
<td>http://kafd.ideas2serve.net/KAFDDownloadOptions.php</td>
</tr>
<tr>
<td> </td>
<td>https://github.com/ainawind27/arabicocr-data</td>
</tr>
<tr>
<td> </td>
<td>http://kitab-project.org/</td>
</tr>
<tr>
<td> </td>
<td>https://medium.com/@openiti/openiti-aocp-9802865a6586</td>
</tr>
<tr>
<td> </td>
<td>https://www.rdi-sotoor.com/#/login</td>
</tr>
<tr>
<td> </td>
<td>https://www.primaresearch.org/RASM2019/</td>
</tr>
<tr>
<td> </td>
<td>https://blogs.bl.uk/digital-scholarship/2018/02/8th-century-arabic-scientists-meet-todays-computer-scientists.html</td>
</tr>
<tr>
<td> </td>
<td>https://blogs.bl.uk/digital-scholarship/2018/03/arabic-handwrittten-ocr.html</td>
</tr>
<tr>
<td> </td>
<td>Raw images: https://fromthepage.com/bldigital/arabic-scientific-manuscripts</td>
</tr>
<tr>
<td>Arabic conversation for chatbots</td>
<td>https://www.kaggle.com/ahmedkaramdev/arabic-conversational-dataset</td>
</tr>
<tr>
<td>WordNet</td>
<td>http://compling.hss.ntu.edu.sg/omw/</td>
</tr>
<tr>
<td> </td>
<td>Resources: http://globalwordnet.org/resources/arabic-wordnet/arabic-resources/</td>
</tr>
<tr>
<td> </td>
<td>http://compling.hss.ntu.edu.sg/omw/wns/arb/LICENSE</td>
</tr>
<tr>
<td>Other</td>
<td>https://www.al-fanarmedia.org/2018/11/an-online-arabic-dictionary-makes-its-debut/#.W__YlAjqMNw.twitter</td>
</tr>
<tr>
<td> </td>
<td>Treebank: https://sourceforge.net/projects/arabicsubcats/files/</td>
</tr>
</tbody>
</table>Husseinhussein.alolimat@gmail.com URL Papers https://arxiv.org/ftp/arxiv/papers/1702/1702.07835.pdf Sentiment Analysis http://archive.ics.uci.edu/ml/datasets/Twitter+Data+set+for+Arabic+Sentiment+Analysis Classification http://archive.ics.uci.edu/ml/datasets/Opinion+Corpus+for+Lebanese+Arabic+Reviews+%28OCLAR%29 Wikipedia https://archive.org/details/arwiki-20190201 Multiple https://lionbridge.ai/datasets/best-arabic-datasets-for-machine-learning/ https://github.com/ibnmalik/golden-corpus-arabic/tree/develop/core https://github.com/linuxscout/tashkeela2 Diacritization https://github.com/AliOsm/arabic-text-diacritization/tree/master/dataset http://tanzil.net/download https://www.kaggle.com/datasets/linuxscout/tashkeela?resource=download Crawls https://traces1.inria.fr/oscar/ https://github.com/alisafaya/Arabic-BERT https://github.com/mohataher/arabic_big_corpus https://github.com/aosaimy/riyadh-corpus-collection https://github.com/anastaw/Arabic-Wikipedia-Corpus/blob/master/Wikipedia-Corpus-30-08-10.sql.gz https://github.com/antcorpus/RSSCrawlerArabicCorpus BBC Crawl: https://github.com/motazsaad/bbc-crawler News https://github.com/parallelfold/SaudiNewsNet https://github.com/motazsaad/Arabic-News ArabicWeb16 Labeled Dataset: https://sites.google.com/view/arabicweb16/download/labelled-datasets?authuser=0 Sample big: https://sites.google.com/view/arabicweb16/getting-started?authuser=0 https://drive.google.com/drive/folders/0B6P2zR7VKiV4SWdITFlXcmxObWM Raw https://github.com/Islamicate-DH/arabicCorpus NER https://github.com/EmnamoR/Arabic-named-entity-recognition https://github.com/RamziSalah/Classical-Arabic-Named-Entity-Recognition-Corpus https://www.cs.cmu.edu/~ark/ArabicNER/ https://github.com/juand-r/entity-recognition-datasets https://github.com/oudalab/Arabic-NER https://github.com/EmnamoR/Arabic-named-entity-recognition/tree/master/ Keyphrase https://github.com/ailab-uniud/akec Sentiment Analysis https://github.com/nora-twairesh/AraSenti https://github.com/almoslmi/masc https://github.com/marwanalomari/Sentiment-Classifier-Logistic-Regression-for-Arabic-Services-Reviews-in-Lebanon https://github.com/komari6/Arabic-twitter-corpus-AJGT https://tahatobaili.github.io/project-rbz/ Speech data http://www.cs.stir.ac.uk/~lss/arabic/ https://github.com/Anwarvic/Arabic-Speech-Recognition Tashkeel https://github.com/Anwarvic/Tashkeela-Model Speech to text https://github.com/motazsaad/jsc-news-broadcast Misspellings https://github.com/linuxscout/aghlat Arabic/English Translation https://github.com/meedan/news-memory Poetry https://github.com/d7eame/Matn WordEmbeddings http://mazajak.inf.ed.ac.uk:8000/ Stories https://github.com/motazsaad/Arabic-Stories-Corpus Dialects https://github.com/motazsaad/corpus2json/tree/master/corpora/nizar_arabic_dialects Opinion Mining https://github.com/AhmedObaidi/omcca POS + rel https://github.com/salsama/Arabic-Information-Extraction-Corpus https://github.com/qcri/dialectal_arabic_pos_tagger https://github.com/seloufian/Arabic-PoS-Tagger Annotated per nationality https://github.com/Data-Science-for-Linguists-2020/Arabic-Learner-Corpus-Considerations OCR Digits: https://www.kaggle.com/mloey1/ahdd1 Letters: https://www.kaggle.com/mloey1/ahcd1 https://cactus.orange-labs.fr/ALIF/download.html http://www.ccse.kfupm.edu.sa/~husni/ArabicOCR/PATS-A02.htm http://kafd.ideas2serve.net/KAFDDownloadOptions.php https://github.com/ainawind27/arabicocr-data http://kitab-project.org/ https://medium.com/@openiti/openiti-aocp-9802865a6586 https://www.rdi-sotoor.com/#/login https://www.primaresearch.org/RASM2019/ https://blogs.bl.uk/digital-scholarship/2018/02/8th-century-arabic-scientists-meet-todays-computer-scientists.html https://blogs.bl.uk/digital-scholarship/2018/03/arabic-handwrittten-ocr.html Raw images: https://fromthepage.com/bldigital/arabic-scientific-manuscripts Arabic conversation for chatbots https://www.kaggle.com/ahmedkaramdev/arabic-conversational-dataset WordNet http://compling.hss.ntu.edu.sg/omw/ Resources: http://globalwordnet.org/resources/arabic-wordnet/arabic-resources/ http://compling.hss.ntu.edu.sg/omw/wns/arb/LICENSE Other https://www.al-fanarmedia.org/2018/11/an-online-arabic-dictionary-makes-its-debut/#.W__YlAjqMNw.twitter Treebank: https://sourceforge.net/projects/arabicsubcats/files/Particle Swarm Optimization: Python Tutorial2016-11-06T21:30:00-05:002016-11-06T21:30:00-05:00https://hussein.space/2016/PSO<p><a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">Particle Swarm Optimization</a> is one of the most successful and famous population-based <a href="https://en.wikipedia.org/wiki/Metaheuristic">metaheuristics</a>. Its simplicity and performance made it easy to be adapted and used in many applications including the tasks of scheduling (more details can be found in my paper— <a href="http://ieeexplore.ieee.org/abstract/document/7280067/">Cloudlet Scheduling with Particle Swarm Optimization</a>) or power consumption optimization in IoT devices (more details can be found in my paper— <a href="https://arxiv.org/pdf/1602.02473.pdf">Particle Swarm Optimized Power Consumption Of Trilateration</a>)</p>
<p>There are many versions of PSO such as the hybrid ones where PSO is used along with other algorithms (such as <a href="https://en.wikipedia.org/wiki/Simulated_annealing">Simulated Annealing</a>) in my publication above. But in general, when we are talking about a pure PSO algorithm we would recognize it as being a single or multi-objective that operates on a discrete or continuous space. The objectives of the algorithm are the things that PSO try to find a solution for. For example, PSO might concentrate on reducing the power consumption of a device without taking into consideration anything else, like the speed of the device. That’s why we developed multi-objective versions to kind of try to balance the solution. Balancing the work of the algorithm to consider more than one objective is part of the <a href="https://en.wikipedia.org/wiki/Game_theory">Game theory</a> field (if you are curious and want to know more look at <a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash Equilibrium</a> and <a href="https://en.wikipedia.org/wiki/Pareto_efficiency">Pareto Optimality</a>).</p>
<p>The original algorithm work on a continuous space. What that means is that it works to solve problems such as the numeric minimization problem I mentioned in the <a href="http://hussein.space/2016/heuristics/">previous Heuristics post</a>. PSO also works on a discrete binary space, which means that the algorithm is used to find a $ 0 $ or $ 1 $ values for the given problem (a simple example can be found in <a href="http://ieeexplore.ieee.org/abstract/document/7280067/">my scheduling paper</a>). But now, let’s go back to our simple minimization problem and try to solve it using PSO. <em>At this point if you didn’t read my other <a href="http://hussein.space/2016/heuristics/">blog post</a> please do so to know what I am trying to solve here.</em></p>
<p>PSO starts by creating a swarm of particles where each particle is a possible solution to the problem. Therefore, we need to understand what exactly we are trying to solve and how to map it to the objective function of PSO, which is considered the hardest part when designing the algorithm.</p>
<p>Let’s first define few global variables needed throughout our program</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># the x and y in our function (x - y + 7) (aka. dimensions)
</span><span class="n">number_of_variables</span> <span class="o">=</span> <span class="mi">2</span>
<span class="c1"># the minimum possible value x or y can take
</span><span class="n">min_value</span> <span class="o">=</span> <span class="o">-</span><span class="mi">100</span>
<span class="c1"># the maximum possible value x or y can take
</span><span class="n">max_value</span> <span class="o">=</span> <span class="mi">100</span>
<span class="c1"># the number of particles in the swarm
</span><span class="n">number_of_particles</span> <span class="o">=</span> <span class="mi">10</span>
<span class="c1"># number of times the algorithm moves each particle in the problem space
</span><span class="n">number_of_iterations</span> <span class="o">=</span> <span class="mi">2000</span>
<span class="n">w</span> <span class="o">=</span> <span class="mf">0.729</span> <span class="c1"># inertia weight
</span><span class="n">c1</span> <span class="o">=</span> <span class="mf">1.49</span> <span class="c1"># cognitive (particle)
</span><span class="n">c2</span> <span class="o">=</span> <span class="mf">1.49</span> <span class="c1"># social (swarm)
</span></code></pre></div></div>
<p>The first step is to create the swarm of particles</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">swarm</span> <span class="o">=</span> <span class="p">[</span><span class="n">Particle</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">,</span> <span class="n">min_value</span><span class="p">,</span> <span class="n">max_value</span><span class="p">)</span>
<span class="k">for</span> <span class="n">__x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_particles</span><span class="p">)]</span>
</code></pre></div></div>
<p>Where each Particle is an <a href="https://en.wikipedia.org/wiki/Abstract_data_type">Abstract Data Type (ADT)</a> defined as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Particle</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">number_of_variables</span><span class="p">,</span> <span class="n">min_value</span><span class="p">,</span> <span class="n">max_value</span><span class="p">):</span>
<span class="c1"># init x and y values
</span> <span class="bp">self</span><span class="p">.</span><span class="n">positions</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.0</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">)]</span>
<span class="c1"># init velocities of x and y
</span> <span class="bp">self</span><span class="p">.</span><span class="n">velocities</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.0</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">):</span>
<span class="c1"># update x and y positions
</span> <span class="bp">self</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">max_value</span> <span class="o">-</span> <span class="n">min_value</span><span class="p">)</span>
<span class="o">*</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">+</span> <span class="n">min_value</span><span class="p">)</span>
<span class="c1"># update x and y velocities
</span> <span class="bp">self</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">max_value</span> <span class="o">-</span> <span class="n">min_value</span><span class="p">)</span>
<span class="o">*</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">+</span> <span class="n">min_value</span><span class="p">)</span>
<span class="c1"># current fitness after updating the x and y values
</span> <span class="bp">self</span><span class="p">.</span><span class="n">fitness</span> <span class="o">=</span> <span class="n">Fitness</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">positions</span><span class="p">)</span>
<span class="c1"># the current particle positions as the best fitness found yet
</span> <span class="bp">self</span><span class="p">.</span><span class="n">best_particle_positions</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">positions</span><span class="p">)</span>
<span class="c1"># the current particle fitness as the best fitness found yet
</span> <span class="bp">self</span><span class="p">.</span><span class="n">best_particle_fitness</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fitness</span>
</code></pre></div></div>
<p>Before I explain what each line means I should first explain how a particle behaves in the problem space then get back to our code snippet above.</p>
<p>As I mentioned before, each particle in the swarm represents a possible solution to the problem. And as I also mentioned in the <a href="http://hussein.space/2016/heuristics/">previous blog post</a>, each particle try to improve its solution by learning from two sources: its movement in the problem space and the movement of the other particles of the swarm (through learning from the best solution found by any of the other particles). To watch a simple visualization of PSO click on the image below:</p>
<p><a href="http://www.youtube.com/embed/_bzRHqmpwvo" target="_blank"><img style="float: center; width: 80%;" src="https://img.youtube.com/vi/_bzRHqmpwvo/hqdefault.jpg" /></a></p>
<p>Now, let’s see how that translates into code. The positions list in the above code snippet represent the current values of the variables in the objective function ($ x $ and $ y $) where the velocities list represent the (artificial) velocities (for each position) of the particle in space. We first initialize the values to zeros then update them using random numbers as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">((</span> <span class="n">max_value</span> <span class="o">-</span> <span class="n">min_value</span><span class="p">)</span> <span class="o">*</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">+</span> <span class="n">min_value</span> <span class="p">)</span>
</code></pre></div></div>
<p>Each particle in our swarm keep track of its fitness value and the best positions and fitness found by any particle of the swarm (including itself). Where a particle fitness is the solution it achieved by plugging the current positions list values in the objective function (in our example problem, $ positions[0] = x $ and $ positions[1] = y $). Notice that during initialization we consider the particle’s fitness and positions as the best ones found yet because it might be the case, later we will check that and update with the correct information during each iteration of the algorithm.</p>
<p>After the initialization of the swarm, we check all particles and find the best solution found and keep track of that using the two variables <em>best_swarm_positions</em> and <em>best_swarm_fitness</em>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">best_swarm_positions</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.0</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">)]</span>
<span class="n">best_swarm_fitness</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">float_info</span><span class="p">.</span><span class="nb">max</span>
<span class="k">for</span> <span class="n">particle</span> <span class="ow">in</span> <span class="n">swarm</span><span class="p">:</span> <span class="c1"># check each particle
</span> <span class="k">if</span> <span class="n">particle</span><span class="p">.</span><span class="n">fitness</span> <span class="o"><</span> <span class="n">best_swarm_fitness</span><span class="p">:</span>
<span class="n">best_swarm_fitness</span> <span class="o">=</span> <span class="n">particle</span><span class="p">.</span><span class="n">fitness</span>
<span class="n">best_swarm_positions</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, we are ready to start moving particles in the problem space by generating new velocities to find new positions ($x$ and $y$ values) and eventually find new solutions (fitness values). The fitness value of each particle in the swarm is going to be updated during each iteration of the algorithm.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">__x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_iterations</span><span class="p">):</span>
<span class="k">for</span> <span class="n">particle</span> <span class="ow">in</span> <span class="n">swarm</span><span class="p">:</span>
<span class="c1"># start moving/updating particles to calculate new fitness
</span></code></pre></div></div>
<p>Then, inside the nested loops, we start updating the velocities and positions and calculate the new fitness while keeping track of the best fitness of the swarm. But first, let’s start by updating the velocities as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compute new velocities for each particle
</span><span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">):</span>
<span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">calculate_new_velocity_value</span><span class="p">(</span><span class="n">particle</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
</code></pre></div></div>
<p>For each variable in the objective function, we should calculate a new velocity to later aid in calculating a new set of positions. That is done by calling <strong>calculate_new_velocity_value()</strong> function and passing the current particle and the variable number to it as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># calculate a new velocity for one variable
</span><span class="k">def</span> <span class="nf">calculate_new_velocity_value</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">v</span><span class="p">):</span>
<span class="c1"># generate random numbers
</span> <span class="n">r1</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span>
<span class="n">r2</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span>
<span class="c1"># the learning rate part
</span> <span class="n">part_1</span> <span class="o">=</span> <span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">p</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">])</span>
<span class="c1"># the cognitive part - learning from itself
</span> <span class="n">part_2</span> <span class="o">=</span> <span class="p">(</span><span class="n">c1</span> <span class="o">*</span> <span class="n">r1</span> <span class="o">*</span>
<span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">best_particle_positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">-</span> <span class="n">p</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]))</span>
<span class="c1"># the social part - learning from others
</span> <span class="n">part_3</span> <span class="o">=</span> <span class="p">(</span><span class="n">c2</span> <span class="o">*</span> <span class="n">r2</span> <span class="o">*</span>
<span class="p">(</span><span class="n">best_swarm_positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">-</span> <span class="n">p</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]))</span>
<span class="n">new_velocity</span> <span class="o">=</span> <span class="n">part_1</span> <span class="o">+</span> <span class="n">part_2</span> <span class="o">+</span> <span class="n">part_3</span>
<span class="k">return</span> <span class="n">new_velocity</span>
</code></pre></div></div>
<p>As shown in the above code snippet, the value of the new velocity is the sum of all of the following three parts:</p>
<h2 id="the-learning-rate-part">The learning rate part:</h2>
<ul>
<li>The multiplication result of the inertia weight parameter ($w$) and the particle’s current velocity.</li>
<li>The inertia weight parameter influences the convergence of the algorithm and the exploration of its particles. Therefore, a well-defined inertia weight is very important in influencing the quality of the solution found by PSO. The higher the inertia weight means bigger steps in the problem space (in other words, higher velocities). There are many types of inertia weights but we use in this example the fixed inertia weight (a static value) which do not change throughout the iterations. To learn more about other kinds of inertia weights read section 2.3.4 in my <a href="https://etd.ohiolink.edu/ap/10?0::NO:10:P10_ACCESSION_NUM:toledo1403922600">master thesis</a>. Also, check how I used the simulated annealing and how it helped PSO in achieving better results in my paper (<a href="http://ieeexplore.ieee.org/abstract/document/7280067/">Cloudlet Scheduling with Particle Swarm Optimization</a>).</li>
</ul>
<h2 id="the-cognitive-part">The cognitive part:</h2>
<ul>
<li>This part of the equation is the multiplication result of the constant <em>c1</em> and the random number <em>r1</em> and the subtraction of the position value that corresponds to the best fitness found by the particle and the current position value.</li>
<li>The overall idea of this part of the equation is to represent the cognitive (self-learning) part of the particle.</li>
</ul>
<h2 id="the-social-part">The social part:</h2>
<ul>
<li>The multiplication result of the constant <em>c2</em> and the random number <em>r2</em> and the subtraction of the position value that corresponds to the best fitness found by any particle of the swarm and the current position value of the particle.</li>
<li>The overall idea of this part of the equation is to represent the social ability of the particle (learning from the swarm).</li>
</ul>
<p>To keep our velocities within our desired range we use the following few lines of code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o"><</span> <span class="n">min_value</span><span class="p">:</span>
<span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">min_value</span>
<span class="k">elif</span> <span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">></span> <span class="n">max_value</span><span class="p">:</span>
<span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_value</span>
</code></pre></div></div>
<p>Now, we are ready to calculate our new positions values using our new velocities to later calculate the new fitness of each particle.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compute new positions using the new velocities
</span><span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_variables</span><span class="p">):</span>
<span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">+=</span> <span class="n">particle</span><span class="p">.</span><span class="n">velocities</span><span class="p">[</span><span class="n">v</span><span class="p">]</span>
</code></pre></div></div>
<p>And again, to keep the values of the positions within control we use the following lines of code as before:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o"><</span> <span class="n">min_value</span><span class="p">:</span>
<span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">min_value</span>
<span class="k">elif</span> <span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">></span> <span class="n">max_value</span><span class="p">:</span>
<span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">max_value</span>
</code></pre></div></div>
<p>Finally, we are ready to calculate the new fitness value using the objective function. As apparent in the following code snippet, we plug the two positions values in the objective function which corresponds to the $ x $ and $ y $ value in the original equation $ x-y+7 $.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compute the fitness of the new positions
</span><span class="n">particle</span><span class="p">.</span><span class="n">fitness</span> <span class="o">=</span> <span class="n">Fitness</span><span class="p">(</span><span class="n">particle</span><span class="p">.</span><span class="n">positions</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">Fitness</span><span class="p">(</span><span class="n">positions</span><span class="p">):</span>
<span class="c1"># x - y + 7
</span> <span class="k">return</span> <span class="n">positions</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">positions</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">7</span>
</code></pre></div></div>
<p>At the end of each iteration, we evaluate the quality of the newly calculated fitness value and use it to do two kinds of updates if it is of a high quality. The first is to update the value of the best fitness found by the particle we are moving and the second is to update the value of the best fitness found by any particle of the swarm. Remember that the whole point of using PSO is to find the values of $ x $ and $ y $ such that we minimize the value of the whole function. Therefore, the best solution to the problem would be $ -100 - +100 + 7 $ which equals to $ -193 $ and PSO would be able to find the correct solution by the end of the iterations.</p>
<p>You can also download the <a href="https://gist.github.com/halolimat/d5e684d4ff92cd9d89f0aa2d91f3a686">full code</a> and play with it yourself. And with that, I finish this post. I hope you learned something new and if you have any question don’t hesitate to contact me. Happy learning!</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>Husseinhussein.alolimat@gmail.comParticle Swarm Optimization is one of the most successful and famous population-based metaheuristics. Its simplicity and performance made it easy to be adapted and used in many applications including the tasks of scheduling (more details can be found in my paper— Cloudlet Scheduling with Particle Swarm Optimization) or power consumption optimization in IoT devices (more details can be found in my paper— Particle Swarm Optimized Power Consumption Of Trilateration)Heuristics: the art of good guessing2016-08-29T06:51:00-04:002016-08-29T06:51:00-04:00https://hussein.space/2016/heuristics<p>In computer science, we find solutions to problems, and one of the tools we use to solve problems is the algorithm. Algorithms are procedures that a computer follows and executes. However, some algorithms are not always the best way to solve certain very complex problems, problems in which a partial or an approximate solution would suffice.</p>
<p>One of the things that I love about computer science is how we use and learn from everything around us and from all fields of study and disciplines. (You can learn more about Computer Science applied to life in a talk by Tom Griffiths titled <a href="https://www.youtube.com/watch?v=lOhL-XUQPFE">The Computer Science of Human Decision Making</a>.) In this article, however, I will explain a very simple concept we use to solve very complex problems. It’s an idea inspired by the social behavior of animals, such as fish that swim in a school or birds that fly in a flock. And it doesn’t involve an algorithm.</p>
<p>What is good enough? The context here is very important when answering that question. For example, if two companies manufacture scales, the first might make scales to weigh highway trucks and the second might make scales to measure gems. The difference in the accuracy of each needed weighing result is very clear. The digital jewelry scale should be able to measure weights even less than 0.1 grams while it is enough for the trucks scale to show the weights in tons.</p>
<p>The other concern we have when solving a problem is the time needed to solve the problem. If the trucks scale mentioned before is going to take half an hour to give me the weight of the truck then I would rather find a different manufacturer which has a product that gives faster solutions. Similarly, when writing a piece of software to solve a problem we care about the accuracy and the time a program takes to find solutions. Therefore, we carefully design the algorithms and try to find creative ways to solve problems.</p>
<p>In computer science, we call a method that sufficiently solves a problem without a guarantee to give a perfect solution a heuristic. These kinds of methods are used to answer the “what is good enough?” question in a reasonable amount of time. If finding the perfect solution takes a year and costs <span>$</span>1,000 then maybe a less accurate but good enough solution that takes a week to find and costs <span>$</span>200 is the perfect solution given all the constraints.</p>
<p>The mathematician <a href="http://link.hussein.space/georg3e88">George Polya</a> [1887-1985], who is considered the father of heuristics, described heuristics as “the art of good guessing”. Guessing what might be a good enough solution of the problem using heuristics is considered part of Artificial Intelligence, which is when a computer actually thinks for itself and finds a good solution, rather than mechanically trying all of the possible combinations.</p>
<p><a href="http://link.hussein.space/douglfc46">Douglas Merrill</a> once said, “All of us is smarter than any of us”. His saying is very evident in the heuristic method designed by Kennedy and Eberhart [1] called Particle Swarm Optimization (PSO). The method is one of the most successful and famous heuristics we use to find approximate “good enough” solutions for very complex and costly problems. PSO is indeed inspired by the fish schooling and birds flocking and depends on the collective intelligence and the high level of collaboration of the swarm. To form a swarm, PSO creates multiple solutions to the problem where each solution (called a particle) represents a fish in the school or a bird in the flock. PSO moves the particles in the problem space by making them follow the one with the best solution, similar to a school or flock where a bird or a fish follow the one closer to the area where the food is. Therefore, PSO is one of the methods that constitutes what is called swarm intelligence systems.</p>
<p><img style="float: right; width: 60%;" src="https://hussein.space/assets/images/heuristics/heuristics_takishi_castle.png" /></p>
<p>Now, let’s take the analogy of <a href="http://link.hussein.space/takese87f">Takeshi’s Castle</a> - Knock Knock game (<a href="http://link.hussein.space/takes8ed7">short clip</a>) to explain how PSO works. The game works when contenders run through consecutive doors to finally arrive at the end point, and therefore, the game is called “The Wall to Freedom”. In the game, each wall has a set of fake and real doors and a contender needs to find the real door to proceed (The full description of the game can be found on Wikipedia at <a href="http://link.hussein.space/listo8bbd">this link</a>). Contenders learn from each other (like a swarm) by following the one who found the real door. And if only one contender was playing the game then multiples of the time is needed to cross all walls since the contender is not learning from anyone without trying by him/herself. Furthermore, since the best solution to the game here is to reach the final point crossing all the 4 walls, a less optimal solution would be to cross 3 or fewer walls. Consequently, the quality of the solution here is judged based on the number of walls crossed, where the higher the number, the better the solution is.</p>
<p>We use PSO to increase the possibility of finding better solutions in less time since in many cases we cannot afford to wait for a long time to find the best solution. The swarm of PSO allows us to investigate multiple solutions to the problem at the same time and judge where to go next in the problem space. Now, let’s take a very simple example of a numeric minimization problem.</p>
<p>Suppose that you have the function $ f(x,y)=x-y+7 $, and I asked you to give me the possible values of $ x $ and $ y $ so that to minimize the output value of the function. If the range of the possible values is between $ +100 $ to $ -100 $ for both variables you’re obviously going to answer $ -100 $ for $ x $ and $ +100 $ for $ y $ which makes the minimum possible value equal to $ -193 $. However, it is not as straight forward to a computer. Computers execute algorithms in steps and in those steps we change the value of $ x $ and $ y $ then check the quality of the solution, more about that in David J. Malan’s talk titled “<a href="http://link.hussein.space/whats81c7">What’s an algorithm?</a>”.</p>
<p>Imagine that we test every possible solution for this problem, making $ x=-100 $ and $ y=-100 $ then $ x=-99 $ and $ y=-100 $ until we reach $ x=100 $ and $ y=100 $ and keep track of the quality of the solution each time we vary the values of $ x $ and $ y $. Then, by the end of the algorithm, we would’ve tried all the possible combinations which equal to $ 200×200=40,000 $ different solutions. The problem I gave you before is a kind of a problem we call a toy (simple) problem in computer science. Now, imagine that we have a more complex problem involving $ 13 $ variables and a range of values between $ -100 $ and $ +100 $. Can you guess how many combinations we have here? I will tell you the answer, we have $ 200^{13} $ combinations, which is around $ 8 $ with $ 29 $ zeros after it.</p>
<p>Accordingly, to find the solution for the example before on the fastest desktop we have now with intel i7 processor the program would finish running after around $ 108 $ billion years (around $ 8 $ times the age of the universe). Which in another way, no thanks! We don’t have enough time to wait. Now a good enough solution doesn’t sound that bad after all!.</p>
<p>Now, I will explain how PSO work on the toy problem for simplicity. PSO would start by creating, for example, $ 100 $ particles and each particle would get a random value between the given range. For example, Particle-1 numbers would be $ x=5, y=-90 $ which would give an answer of $ 102 $. Particle-2 with the numbers $ x=-90, y=10 $ which would give the answer $ -93 $, and so on. Obviously, the answer of Particle-2 is much better so the idea here is for all the other particles to learn from Particle-2 by assigning random numbers that are close to the ones Particle-2 used before. It means that the particles would learn that it is wiser to use negative values for $ x $ and positives ones for $ y $.</p>
<p>Since PSO is a heuristic method with no guarantees of a perfect solution we can stop running the algorithm at a given point in time and say, OK!, What I found so far is good enough for me. In our previous work [2], however, we used PSO in a combination of other methods and we were able to decrease the running time by about $ 49\% $ of that of the normal method without compromising the quality of the solution. And that was only possible because the swarm allowed us to reflect the idea of good guessing, making PSO the Leonardo da Vinci of computer algorithms.</p>
<p>Interested in learning how to code that solution in Python? Then you can find the step-by-step tutorial explaining how to implement that in the following <a href="https://hussein.space/2016/PSO/">blog post</a>.</p>
<h2 id="references">References:</h2>
<p>[1] Eberhart, Russ C., and James Kennedy. “A new optimizer using particle swarm theory.” In Proceedings of the sixth international symposium on micro machine and human science, vol. 1, pp. 39-43. 1995.</p>
<p>[2] Al-Olimat, H. S., Alam, M., Green, R., & Lee, J. K. (2015). Cloudlet Scheduling with Particle Swarm Optimization. In 2015 Fifth International Conference on Communication Systems and Network Technologies (pp. 991–995). <a href="http://link.hussein.space/ieeex58ac">http://link.hussein.space/ieeex58ac</a></p>
<p>[3] Bird Flock. Digital image. Wikimedia.org, n.d. Web. 26 July 2016. <a href="http://link.hussein.space/orgwi96f0">http://link.hussein.space/orgwi96f0</a>.</p>
<p>[4] Djrhythmdave. “Takeshi’s Castle (UK Dub) - Knock Knock.” YouTube. YouTube, 15 May 2015. Web. 26 July 2016. <a href="http://link.hussein.space/takes2c9d">http://link.hussein.space/takes2c9d</a>.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>Husseinhussein.alolimat@gmail.comIn computer science, we find solutions to problems, and one of the tools we use to solve problems is the algorithm. Algorithms are procedures that a computer follows and executes. However, some algorithms are not always the best way to solve certain very complex problems, problems in which a partial or an approximate solution would suffice.Extracting and Mapping Location Mentions From Texts To The Ground2016-05-23T15:38:00-04:002016-05-23T15:38:00-04:00https://hussein.space/2016/location-extraction-literature<p>The meaning of a social media post can be a function of location. For example, the meaning of “The Main Street bridge is closed” is ambiguous without establishing exactly which bridge is in question (The one in Danville, VA or in Columbus, OH). At the same time, location information metadata is sparse, forcing analysis of social media content and context to disambiguate alternative mappings. This article uncovers some of the persisting challenges in the recovery of location information from content and context: Text normalization, ambiguous location information, geoparsing, and the future steps of our research.</p>
<p>Consider the Tweet in Figure 1. It contains valuable, but implicit information for Disaster Response and Flood Modeling. Here the user provides the level of water during a storm surge. Knowledge-based inference supports the enrichment of this claim to determine that the water level is around 3 meters (to the height of a first floor)<a title="The issue of reliability and trustworthiness of the extracted information are relevant to our project but are not discussed here."><sup>*</sup></a>. If we knew the location of Ganapathi colony, this quantitative data can inform a storm surge model to predict the direction of the surge and the danger it might pose.</p>
<blockquote class="twitter-tweet" data-cards="hidden" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ChennaiRainsHelp?src=hash">#ChennaiRainsHelp</a> water hitting first floor. need to evacuate Ganapathi colony. <a href="https://twitter.com/ChennaiRains">@ChennaiRains</a> <a href="https://twitter.com/ndtv">@ndtv</a> <a href="https://t.co/rNvx4i4JVH">pic.twitter.com/rNvx4i4JVH</a></p>— Balaji (@blaji) <a href="https://twitter.com/blaji/status/671942649570398208">December 2, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Fig. 1: An example tweet from Twitris campaign of Chennai Flood 2015.</p>
<p>At Kno.e.sis center, one of the goals of our NSF-funded project (<a href="http://wiki.knoesis.org/index.php/Social_and_Physical_Sensing_Enabled_Decision_Support">Social and Physical Sensing Enabled Decision Support</a> is to make information available in social media accessible to first-responders to prioritize relief efforts. Mapping events to locations in order to attach ground-based information to locations on the map will allow us to achieve the desired goal. In contrast to physical sensors, the resolution of social media data is a function of population rather than specialized infrastructure. In the case of lack of sensor/IoT coverage or malfunctioning of sensors, citizen sensing can provide information about ground status to compensate for missing data.</p>
<p>We collect real-time, event-centric Twitter data to understand social perceptions. During the 2015 Chennai floods, we collected around 508K relevant tweets and determining locations is an important part of making these tweets more informative. Next, I will discuss how location extraction and mapping is performed, mainly in two steps: Toponym extraction and geoparsing.</p>
<h1 id="toponym-extraction">Toponym Extraction</h1>
<p>Toponym extraction is the process of extracting names of places from texts such as street names, points of interest (POI), cities, countries, and so on. There are two traditional ways to extract toponyms from texts: a supervised approach and a gazetteer-matching approach.</p>
<p><strong><em>Supervised approaches</em></strong>. In the supervised approach, we train the model using manually annotated data of location mentions [1]. Supervised approaches tend to suffer from the underestimation of missing data (aka. Data Sparsity). They require sufficient amount of annotated data from, mainly, the same data source (e.g., microblog text), to enable location detections from similar data sources. However, the gazetteer approach discussed next has its own difficulties and issues that must be solved to extract locations from texts efficiently.</p>
<p><img src="https://lh6.googleusercontent.com/n2zXmKglLvRvvQL5tSDWKuLTBiIJzAVNfVsbz5r1_C4JYPJ5kiurKfq8hxXzrRk5_mzKqqs8yBCTrV7sw-ELF_FGyIug4s2_-HSCtIE-bv6W7OEfzdgLZnWirMP6Lf0nMt2fIa1g" alt="Syntactic Parse tree" /></p>
<p><strong><em>Gazetteers approaches</em></strong>. In the gazetteer approach, we extract location mentions on the fly without using any training dataset. Gazetteer approaches often use syntactic parse trees (for noun phrase extraction), direction and distance markers, gazetteers, dictionaries and many other knowledge bases in order to extract locations from texts [2-4]. Figure 2 shows part of the parse tree built using NLTK of the tweet text in Figure 3. Parsing the tweet text allows us to find noun phrases using the NLTK’s cascaded chunk parser. The parser matches a set of predefined rules to text. For example, the rule (<code class="language-plaintext highlighter-rouge">VP: {<VB.*><NP|PP|CLAUSE>+$}</code>) allows us to detect and extract the noun phrase (NP) “SRM university” which follows the preposition (PP) “Near”.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Please evacuate us from area Near SRM university, kattankulathur. No food.. No current.. Nothing is here . <a href="https://twitter.com/hashtag/ChennaiRains?src=hash">#ChennaiRains</a> <a href="https://twitter.com/hashtag/ChennaiRainsHelp?src=hash">#ChennaiRainsHelp</a></p>— Rohit kumar singh (@imRohit09) <a href="https://twitter.com/imRohit09/status/671742154327252992">December 1, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Fig. 3: Tweet mentioning the toponyms “SRM university” and “kattankulathur”.</p>
<p>Similarly, direction and distance markers allow us to retrieve toponyms the markers are pointing at. For example, in Figure 4. The direction marker “south of” points at the toponym “101 Fwy” which is then added to our list of potential geo-parsable toponym names.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">THIS JUST IN: Power out in <a href="https://twitter.com/hashtag/StudioCity?src=hash">#StudioCity</a> after reports of loud explosion in area south of 101 Fwy between Woodman Ave and Coldwater Canyon.</p>— ABC7 Eyewitness News (@ABC7) <a href="https://twitter.com/ABC7/status/717933755776655361">April 7, 2016</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Fig. 4: Tweet mentioning a toponym (“101 Fwy”) pointed at by a direction marker</p>
<p>Two challenges arise in Toponym extraction: Text normalization and ambiguous location information.</p>
<p><strong><em>Text normalization</em></strong>. Text normalization involves subtasks such as abbreviations and acronyms expansion and misspelling corrections. Figure 5 shows an example of a tweet with such difficulties. The author of the tweet used “Rd” as an abbreviation of “road”. Moreover, the text “Kilpauk Garden” is incomplete relative to the Gazetteer name “Kilpauk, Aspiran Garden Colony”.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">2 ppl need to be evacuated frm #21, New Avadi Rd, Kilpauk Garden (Nr. Zam Zam Bakery) Contact 9841546767 <a href="https://twitter.com/hashtag/ChennaiRainsHelp?src=hash">#ChennaiRainsHelp</a> <a href="https://twitter.com/hashtag/chennairains?src=hash">#chennairains</a></p>— Raj!n! Followers™ (@RajiniFollowers) <a href="https://twitter.com/RajiniFollowers/status/671730645383569408">December 1, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Fig. 5: An example tweet with abbreviations (Rd) and incomplete information (Kilpauk Garden).</p>
<p>Locations can also be embedded in hashtags or usernames. For example, both @yankeestadium and #YankeeStadium refer to the location name “Yankee Stadium”. Therefore, such location mentions can also be extracted using a word segmentation (tokenization) method such as <a href="https://github.com/grantjenks/python-wordsegment">WordSegment</a> which uses a statistical method find word boundaries.</p>
<p><strong><em>Ambiguous Location Information</em></strong>. Location information is not always explicit. The relative directionality and distance content noted above hints at this problem. Consider the following tweet (Figure 6) as a more challenging example:</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Our house in Chennai fully flooded. Everyone evacuated safely. My brother trying to get there from Mumbai...</p>— Karun Chandhok (@karunchandhok) <a href="https://twitter.com/karunchandhok/status/671980121176281088">December 2, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Fig. 6: A tweet showing an ambiguous mention of a location.</p>
<p>In this example, a renowned author (Indian racing driver Karun Chandhok) is referring to his parent’s house. Ideally, the location of the house could be extracted from a knowledge base. The extracted toponym “our house” should then be mapped to an absolute location name. This extracted piece of information can then provide us with the fact that people are evacuating from his parents’ area.</p>
<h1 id="geoparsing">geoparsing</h1>
<p>geoparsing contrasts with geocoding, and both follow toponym extraction. Geocoding works with unambiguous location references such as postal addresses to specify a location on Earth using coordinates (latitude and longitude). geoparsing is similar to Geocoding but differs in that it works with ambiguous location references in unstructured texts (such as tweets).</p>
<p>geoparsing can be performed through a gazetteer matching process that allows us to retrieve all the metadata of the matched location. OpenStreetMap, for example, provides information such as the bounding box, the latitude and longitude, the full address, the class of the location name (<a href="http://wiki.openstreetmap.org/wiki/Map_Features">Map Features</a>), and the full display name of the matched toponym. The information extracted after a successful gazetteer matching pinpoints the toponym on the map and attaches to it the extracted metadata. The following map (Figure 7) shows the mapped toponym from the tweet in Figure 5.</p>
<p><img src="https://lh6.googleusercontent.com/zXtdf1ZctavtrdWwlwkbfJW4g0-epqVRHBUbPYOowDiyhPh1liVirCCobP623Tqm8XW1IzbcC3xpknAXffrvspfV7VITZ9Hfvxvbr-d4pH_x9Ihab9Om42N09VHhRhfKRrroTV7L" alt="GoogleMap" />
Fig. 7: Pinpointing the full location name of the extracted toponym from the tweet in Figure 5: “No. 17/7, New Avadi Road, Kilpauk, Aspiran Garden Colony, Kilpauk, Chennai, Tamil Nadu 600010, India”</p>
<p>A typical gazetteer matching task requires complex text normalization and missing data restoration. To overcome some of the difficulties posed by Twitter data we typically use fuzzy text matching during toponym extraction, in addition to the previously discussed text normalization process. As for the incompleteness of gazetteers, a combination of one or more additional knowledge bases can be used. An example of such dictionaries is a list of points of interest<a title="Area specific points of interest (for example, in Chennai) are typically businesses, hospitals, shopping malls, etc."><sup>*</sup></a> that can be retrieved from an external data source.</p>
<p>Other things our research is concerned about is the problem of disambiguation during geoparsing. If a toponym name has many records in the gazetteer, the method should reasonably disambiguate which location the tweet was referring to. This problem includes the Whole-Part Relationship (i.e., which section of the road and which campus of a university). Using the provided context from text as shown in Figure 4, where the toponym “101 Fwy” is supposed to be “between Woodman Ave and Coldwater Canyon”, can tremendously help in solving such problems. Our research is currently investigating such problems and possible solutions.</p>
<h1 id="updates">Updates:</h1>
<ul>
<li>In 2018, we published a <a href="https://hussein.space/publications/LNEx/">gazetteer-based approach (LNEx)</a> that solves the problem of location extraction and linking with an average F1-score of 81%.</li>
<li>In 2019, we submitted a paper on geoparsing spatial expressions like “near” and “between” which follows the recognition of atomic toponyms extracted by our tool LNEx. <a href="https://www.researchgate.net/publication/333718820_Towards_Geocoding_Spatial_Expressions?_sg=iymsj3oR-gaWePrmRJqFSHBteGCyroNTlana28lCzzloLUjYa5Lgj_71DRbQL2tf7FtQRgwWGB7AbA.-yUO0IGJMEBFHOkaCjhUx2MClmc7Z5X0Wtb5PPAdg8WiQFElsOP6GGAC64KBNSrtZl5hQudiW0YuQCaRnPP0kw&_sgd%5Bnc%5D=3&_sgd%5Bncwor%5D=0">Link to paper</a></li>
</ul>
<h2 id="references">References:</h2>
<p>[1] Lingad, John, Sarvnaz Karimi, and Jie Yin. “Location extraction from disaster-related microblogs.” In Proceedings of the 22nd international conference on World Wide Web companion, pp. 1017-1020. International World Wide Web Conferences Steering Committee, 2013.</p>
<p>[2] Gelernter, Judith, and Shilpa Balaji. “An algorithm for local geoparsing of microtext.” GeoInformatica 17, no. 4 (2013): 635-667.</p>
<p>[3] Shervin Malmasi, Mark Dras. “Location Mention Detection in Tweets and Microblogs”. Computational Linguistics. Volume 593 of the series Communications in Computer and Information Science pp 123-134. Springer February 2016.</p>
<p>[4] Middleton, Stuart E., Lee Middleton, and Stefano Modafferi. “Real-time crisis mapping of natural disasters using social media.” Intelligent Systems, IEEE 29, no. 2 (2014): 9-17.</p>Husseinhussein.alolimat@gmail.comThe meaning of a social media post can be a function of location. For example, the meaning of “The Main Street bridge is closed” is ambiguous without establishing exactly which bridge is in question (The one in Danville, VA or in Columbus, OH). At the same time, location information metadata is sparse, forcing analysis of social media content and context to disambiguate alternative mappings. This article uncovers some of the persisting challenges in the recovery of location information from content and context: Text normalization, ambiguous location information, geoparsing, and the future steps of our research.