Hussein S. Al-Olimat, PhD

ArabicNLP Dataset

2020-12-05T11:24:00-05:00

	URL
Papers	https://arxiv.org/ftp/arxiv/papers/1702/1702.07835.pdf
Sentiment Analysis	http://archive.ics.uci.edu/ml/datasets/Twitter+Data+set+for+Arabic+Sentiment+Analysis
Classification	http://archive.ics.uci.edu/ml/datasets/Opinion+Corpus+for+Lebanese+Arabic+Reviews+%28OCLAR%29
Wikipedia	https://archive.org/details/arwiki-20190201
Multiple	https://lionbridge.ai/datasets/best-arabic-datasets-for-machine-learning/
	https://github.com/ibnmalik/golden-corpus-arabic/tree/develop/core
	https://github.com/linuxscout/tashkeela2
Diacritization	https://github.com/AliOsm/arabic-text-diacritization/tree/master/dataset
	http://tanzil.net/download
	https://www.kaggle.com/datasets/linuxscout/tashkeela?resource=download
Crawls	https://traces1.inria.fr/oscar/
	https://github.com/alisafaya/Arabic-BERT
	https://github.com/mohataher/arabic_big_corpus
	https://github.com/aosaimy/riyadh-corpus-collection
	https://github.com/anastaw/Arabic-Wikipedia-Corpus/blob/master/Wikipedia-Corpus-30-08-10.sql.gz
	https://github.com/antcorpus/RSSCrawlerArabicCorpus
	BBC Crawl: https://github.com/motazsaad/bbc-crawler
News	https://github.com/parallelfold/SaudiNewsNet
	https://github.com/motazsaad/Arabic-News
ArabicWeb16	Labeled Dataset: https://sites.google.com/view/arabicweb16/download/labelled-datasets?authuser=0
	Sample big: https://sites.google.com/view/arabicweb16/getting-started?authuser=0 https://drive.google.com/drive/folders/0B6P2zR7VKiV4SWdITFlXcmxObWM
Raw	https://github.com/Islamicate-DH/arabicCorpus
NER	https://github.com/EmnamoR/Arabic-named-entity-recognition
	https://github.com/RamziSalah/Classical-Arabic-Named-Entity-Recognition-Corpus
	https://www.cs.cmu.edu/~ark/ArabicNER/
	https://github.com/juand-r/entity-recognition-datasets
	https://github.com/oudalab/Arabic-NER
	https://github.com/EmnamoR/Arabic-named-entity-recognition/tree/master/
Keyphrase	https://github.com/ailab-uniud/akec
Sentiment Analysis	https://github.com/nora-twairesh/AraSenti
	https://github.com/almoslmi/masc
	https://github.com/marwanalomari/Sentiment-Classifier-Logistic-Regression-for-Arabic-Services-Reviews-in-Lebanon
	https://github.com/komari6/Arabic-twitter-corpus-AJGT
	https://tahatobaili.github.io/project-rbz/
Speech data	http://www.cs.stir.ac.uk/~lss/arabic/
	https://github.com/Anwarvic/Arabic-Speech-Recognition
Tashkeel	https://github.com/Anwarvic/Tashkeela-Model
Speech to text	https://github.com/motazsaad/jsc-news-broadcast
Misspellings	https://github.com/linuxscout/aghlat
Arabic/English Translation	https://github.com/meedan/news-memory
Poetry	https://github.com/d7eame/Matn
WordEmbeddings	http://mazajak.inf.ed.ac.uk:8000/
Stories	https://github.com/motazsaad/Arabic-Stories-Corpus
Dialects	https://github.com/motazsaad/corpus2json/tree/master/corpora/nizar_arabic_dialects
Opinion Mining	https://github.com/AhmedObaidi/omcca
POS + rel	https://github.com/salsama/Arabic-Information-Extraction-Corpus
	https://github.com/qcri/dialectal_arabic_pos_tagger
	https://github.com/seloufian/Arabic-PoS-Tagger
Annotated per nationality	https://github.com/Data-Science-for-Linguists-2020/Arabic-Learner-Corpus-Considerations
OCR	Digits: https://www.kaggle.com/mloey1/ahdd1
	Letters: https://www.kaggle.com/mloey1/ahcd1
	https://cactus.orange-labs.fr/ALIF/download.html
	http://www.ccse.kfupm.edu.sa/~husni/ArabicOCR/PATS-A02.htm
	http://kafd.ideas2serve.net/KAFDDownloadOptions.php
	https://github.com/ainawind27/arabicocr-data
	http://kitab-project.org/
	https://medium.com/@openiti/openiti-aocp-9802865a6586
	https://www.rdi-sotoor.com/#/login
	https://www.primaresearch.org/RASM2019/
	https://blogs.bl.uk/digital-scholarship/2018/02/8th-century-arabic-scientists-meet-todays-computer-scientists.html
	https://blogs.bl.uk/digital-scholarship/2018/03/arabic-handwrittten-ocr.html
	Raw images: https://fromthepage.com/bldigital/arabic-scientific-manuscripts
Arabic conversation for chatbots	https://www.kaggle.com/ahmedkaramdev/arabic-conversational-dataset
WordNet	http://compling.hss.ntu.edu.sg/omw/
	Resources: http://globalwordnet.org/resources/arabic-wordnet/arabic-resources/
	http://compling.hss.ntu.edu.sg/omw/wns/arb/LICENSE
Other	https://www.al-fanarmedia.org/2018/11/an-online-arabic-dictionary-makes-its-debut/#.W__YlAjqMNw.twitter
	Treebank: https://sourceforge.net/projects/arabicsubcats/files/

Particle Swarm Optimization: Python Tutorial

2016-11-06T21:30:00-05:00

Particle Swarm Optimization is one of the most successful and famous population-based metaheuristics. Its simplicity and performance made it easy to be adapted and used in many applications including the tasks of scheduling (more details can be found in my paper— Cloudlet Scheduling with Particle Swarm Optimization) or power consumption optimization in IoT devices (more details can be found in my paper— Particle Swarm Optimized Power Consumption Of Trilateration)

There are many versions of PSO such as the hybrid ones where PSO is used along with other algorithms (such as Simulated Annealing) in my publication above. But in general, when we are talking about a pure PSO algorithm we would recognize it as being a single or multi-objective that operates on a discrete or continuous space. The objectives of the algorithm are the things that PSO try to find a solution for. For example, PSO might concentrate on reducing the power consumption of a device without taking into consideration anything else, like the speed of the device. That’s why we developed multi-objective versions to kind of try to balance the solution. Balancing the work of the algorithm to consider more than one objective is part of the Game theory field (if you are curious and want to know more look at Nash Equilibrium and Pareto Optimality).

The original algorithm work on a continuous space. What that means is that it works to solve problems such as the numeric minimization problem I mentioned in the previous Heuristics post. PSO also works on a discrete binary space, which means that the algorithm is used to find a $ 0 $ or $ 1 $ values for the given problem (a simple example can be found in my scheduling paper). But now, let’s go back to our simple minimization problem and try to solve it using PSO. At this point if you didn’t read my other blog post please do so to know what I am trying to solve here.

PSO starts by creating a swarm of particles where each particle is a possible solution to the problem. Therefore, we need to understand what exactly we are trying to solve and how to map it to the objective function of PSO, which is considered the hardest part when designing the algorithm.

Let’s first define few global variables needed throughout our program

# the x and y in our function (x - y + 7) (aka. dimensions)
number_of_variables = 2
# the minimum possible value x or y can take
min_value = -100
# the maximum possible value x or y can take
max_value = 100  
# the number of particles in the swarm
number_of_particles = 10
# number of times the algorithm moves each particle in the problem space
number_of_iterations = 2000

w = 0.729 # inertia weight
c1 = 1.49 # cognitive (particle)
c2 = 1.49 # social (swarm)

The first step is to create the swarm of particles

swarm = [Particle(number_of_variables, min_value, max_value)
            for __x in range(number_of_particles)]

Where each Particle is an Abstract Data Type (ADT) defined as follows:

class Particle:

  def __init__(self, number_of_variables, min_value, max_value):

        # init x and y values
        self.positions = [0.0 for v in range(number_of_variables)]
        # init velocities of x and y
        self.velocities = [0.0 for v in range(number_of_variables)]

        for v in range(number_of_variables):
            # update x and y positions
            self.positions[v] = ((max_value - min_value)
                                    * random.random() + min_value)
            # update x and y velocities
            self.velocities[v] = ((max_value - min_value)
                                    * random.random() + min_value)

        # current fitness after updating the x and y values
        self.fitness = Fitness(self.positions)
        # the current particle positions as the best fitness found yet
        self.best_particle_positions = list(self.positions)
        # the current particle fitness as the best fitness found yet
        self.best_particle_fitness = self.fitness

Before I explain what each line means I should first explain how a particle behaves in the problem space then get back to our code snippet above.

As I mentioned before, each particle in the swarm represents a possible solution to the problem. And as I also mentioned in the previous blog post, each particle try to improve its solution by learning from two sources: its movement in the problem space and the movement of the other particles of the swarm (through learning from the best solution found by any of the other particles). To watch a simple visualization of PSO click on the image below:

Now, let’s see how that translates into code. The positions list in the above code snippet represent the current values of the variables in the objective function ($ x $ and $ y $) where the velocities list represent the (artificial) velocities (for each position) of the particle in space. We first initialize the values to zeros then update them using random numbers as follows:

(( max_value - min_value) * random.random() +  min_value )

Each particle in our swarm keep track of its fitness value and the best positions and fitness found by any particle of the swarm (including itself). Where a particle fitness is the solution it achieved by plugging the current positions list values in the objective function (in our example problem, $ positions[0] = x $ and $ positions[1] = y $). Notice that during initialization we consider the particle’s fitness and positions as the best ones found yet because it might be the case, later we will check that and update with the correct information during each iteration of the algorithm.

After the initialization of the swarm, we check all particles and find the best solution found and keep track of that using the two variables best_swarm_positions and best_swarm_fitness.

best_swarm_positions = [0.0 for v in range(number_of_variables)]
best_swarm_fitness = sys.float_info.max

for particle in swarm: # check each particle
    if particle.fitness < best_swarm_fitness:
        best_swarm_fitness = particle.fitness
        best_swarm_positions = list(particle.positions)

Now, we are ready to start moving particles in the problem space by generating new velocities to find new positions ($x$ and $y$ values) and eventually find new solutions (fitness values). The fitness value of each particle in the swarm is going to be updated during each iteration of the algorithm.

for __x in range(number_of_iterations):
    for particle in swarm:
        # start moving/updating particles to calculate new fitness

Then, inside the nested loops, we start updating the velocities and positions and calculate the new fitness while keeping track of the best fitness of the swarm. But first, let’s start by updating the velocities as follows:

# compute new velocities for each particle
for v in range(number_of_variables):
    particle.velocities[v] = calculate_new_velocity_value(particle, v)

For each variable in the objective function, we should calculate a new velocity to later aid in calculating a new set of positions. That is done by calling calculate_new_velocity_value() function and passing the current particle and the variable number to it as follows:

# calculate a new velocity for one variable
def calculate_new_velocity_value(p, v):

    # generate random numbers
    r1 = random.random()
    r2 = random.random()

    # the learning rate part
    part_1 = (w * p.velocities[v])
    # the cognitive part - learning from itself
    part_2 = (c1 * r1 *
                  (p.best_particle_positions[v] - p.positions[v]))
    # the social part - learning from others
    part_3 = (c2 * r2 *
                  (best_swarm_positions[v] - p.positions[v]))

    new_velocity = part_1 + part_2 + part_3

    return new_velocity

As shown in the above code snippet, the value of the new velocity is the sum of all of the following three parts:

The learning rate part:

The multiplication result of the inertia weight parameter ($w$) and the particle’s current velocity.
The inertia weight parameter influences the convergence of the algorithm and the exploration of its particles. Therefore, a well-defined inertia weight is very important in influencing the quality of the solution found by PSO. The higher the inertia weight means bigger steps in the problem space (in other words, higher velocities). There are many types of inertia weights but we use in this example the fixed inertia weight (a static value) which do not change throughout the iterations. To learn more about other kinds of inertia weights read section 2.3.4 in my master thesis. Also, check how I used the simulated annealing and how it helped PSO in achieving better results in my paper (Cloudlet Scheduling with Particle Swarm Optimization).

The cognitive part:

This part of the equation is the multiplication result of the constant c1 and the random number r1 and the subtraction of the position value that corresponds to the best fitness found by the particle and the current position value.
The overall idea of this part of the equation is to represent the cognitive (self-learning) part of the particle.

The multiplication result of the constant c2 and the random number r2 and the subtraction of the position value that corresponds to the best fitness found by any particle of the swarm and the current position value of the particle.
The overall idea of this part of the equation is to represent the social ability of the particle (learning from the swarm).

To keep our velocities within our desired range we use the following few lines of code:

if particle.velocities[v] < min_value:
    particle.velocities[v] = min_value
elif particle.velocities[v] > max_value:
    particle.velocities[v] = max_value

Now, we are ready to calculate our new positions values using our new velocities to later calculate the new fitness of each particle.

# compute new positions using the new velocities
for v in range(number_of_variables):
    particle.positions[v] += particle.velocities[v]

And again, to keep the values of the positions within control we use the following lines of code as before:

if particle.positions[v] < min_value:
    particle.positions[v] = min_value
elif particle.positions[v] > max_value:
    particle.positions[v] = max_value

Finally, we are ready to calculate the new fitness value using the objective function. As apparent in the following code snippet, we plug the two positions values in the objective function which corresponds to the $ x $ and $ y $ value in the original equation $ x-y+7 $.

# compute the fitness of the new positions
particle.fitness = Fitness(particle.positions)

def Fitness(positions):
    #                 x -            y + 7
    return positions[0] - positions[1] + 7

At the end of each iteration, we evaluate the quality of the newly calculated fitness value and use it to do two kinds of updates if it is of a high quality. The first is to update the value of the best fitness found by the particle we are moving and the second is to update the value of the best fitness found by any particle of the swarm. Remember that the whole point of using PSO is to find the values of $ x $ and $ y $ such that we minimize the value of the whole function. Therefore, the best solution to the problem would be $ -100 - +100 + 7 $ which equals to $ -193 $ and PSO would be able to find the correct solution by the end of the iterations.

You can also download the full code and play with it yourself. And with that, I finish this post. I hope you learned something new and if you have any question don’t hesitate to contact me. Happy learning!

Heuristics: the art of good guessing

2016-08-29T06:51:00-04:00

In computer science, we find solutions to problems, and one of the tools we use to solve problems is the algorithm. Algorithms are procedures that a computer follows and executes. However, some algorithms are not always the best way to solve certain very complex problems, problems in which a partial or an approximate solution would suffice.

One of the things that I love about computer science is how we use and learn from everything around us and from all fields of study and disciplines. (You can learn more about Computer Science applied to life in a talk by Tom Griffiths titled The Computer Science of Human Decision Making.) In this article, however, I will explain a very simple concept we use to solve very complex problems. It’s an idea inspired by the social behavior of animals, such as fish that swim in a school or birds that fly in a flock. And it doesn’t involve an algorithm.

What is good enough? The context here is very important when answering that question. For example, if two companies manufacture scales, the first might make scales to weigh highway trucks and the second might make scales to measure gems. The difference in the accuracy of each needed weighing result is very clear. The digital jewelry scale should be able to measure weights even less than 0.1 grams while it is enough for the trucks scale to show the weights in tons.

The other concern we have when solving a problem is the time needed to solve the problem. If the trucks scale mentioned before is going to take half an hour to give me the weight of the truck then I would rather find a different manufacturer which has a product that gives faster solutions. Similarly, when writing a piece of software to solve a problem we care about the accuracy and the time a program takes to find solutions. Therefore, we carefully design the algorithms and try to find creative ways to solve problems.

In computer science, we call a method that sufficiently solves a problem without a guarantee to give a perfect solution a heuristic. These kinds of methods are used to answer the “what is good enough?” question in a reasonable amount of time. If finding the perfect solution takes a year and costs $1,000 then maybe a less accurate but good enough solution that takes a week to find and costs $200 is the perfect solution given all the constraints.

The mathematician George Polya [1887-1985], who is considered the father of heuristics, described heuristics as “the art of good guessing”. Guessing what might be a good enough solution of the problem using heuristics is considered part of Artificial Intelligence, which is when a computer actually thinks for itself and finds a good solution, rather than mechanically trying all of the possible combinations.

Douglas Merrill once said, “All of us is smarter than any of us”. His saying is very evident in the heuristic method designed by Kennedy and Eberhart [1] called Particle Swarm Optimization (PSO). The method is one of the most successful and famous heuristics we use to find approximate “good enough” solutions for very complex and costly problems. PSO is indeed inspired by the fish schooling and birds flocking and depends on the collective intelligence and the high level of collaboration of the swarm. To form a swarm, PSO creates multiple solutions to the problem where each solution (called a particle) represents a fish in the school or a bird in the flock. PSO moves the particles in the problem space by making them follow the one with the best solution, similar to a school or flock where a bird or a fish follow the one closer to the area where the food is. Therefore, PSO is one of the methods that constitutes what is called swarm intelligence systems.

Now, let’s take the analogy of Takeshi’s Castle - Knock Knock game (short clip) to explain how PSO works. The game works when contenders run through consecutive doors to finally arrive at the end point, and therefore, the game is called “The Wall to Freedom”. In the game, each wall has a set of fake and real doors and a contender needs to find the real door to proceed (The full description of the game can be found on Wikipedia at this link). Contenders learn from each other (like a swarm) by following the one who found the real door. And if only one contender was playing the game then multiples of the time is needed to cross all walls since the contender is not learning from anyone without trying by him/herself. Furthermore, since the best solution to the game here is to reach the final point crossing all the 4 walls, a less optimal solution would be to cross 3 or fewer walls. Consequently, the quality of the solution here is judged based on the number of walls crossed, where the higher the number, the better the solution is.

We use PSO to increase the possibility of finding better solutions in less time since in many cases we cannot afford to wait for a long time to find the best solution. The swarm of PSO allows us to investigate multiple solutions to the problem at the same time and judge where to go next in the problem space. Now, let’s take a very simple example of a numeric minimization problem.

Suppose that you have the function $ f(x,y)=x-y+7 $, and I asked you to give me the possible values of $ x $ and $ y $ so that to minimize the output value of the function. If the range of the possible values is between $ +100 $ to $ -100 $ for both variables you’re obviously going to answer $ -100 $ for $ x $ and $ +100 $ for $ y $ which makes the minimum possible value equal to $ -193 $. However, it is not as straight forward to a computer. Computers execute algorithms in steps and in those steps we change the value of $ x $ and $ y $ then check the quality of the solution, more about that in David J. Malan’s talk titled “What’s an algorithm?”.

Imagine that we test every possible solution for this problem, making $ x=-100 $ and $ y=-100 $ then $ x=-99 $ and $ y=-100 $ until we reach $ x=100 $ and $ y=100 $ and keep track of the quality of the solution each time we vary the values of $ x $ and $ y $. Then, by the end of the algorithm, we would’ve tried all the possible combinations which equal to $ 200×200=40,000 $ different solutions. The problem I gave you before is a kind of a problem we call a toy (simple) problem in computer science. Now, imagine that we have a more complex problem involving $ 13 $ variables and a range of values between $ -100 $ and $ +100 $. Can you guess how many combinations we have here? I will tell you the answer, we have $ 200^{13} $ combinations, which is around $ 8 $ with $ 29 $ zeros after it.

Accordingly, to find the solution for the example before on the fastest desktop we have now with intel i7 processor the program would finish running after around $ 108 $ billion years (around $ 8 $ times the age of the universe). Which in another way, no thanks! We don’t have enough time to wait. Now a good enough solution doesn’t sound that bad after all!.

Now, I will explain how PSO work on the toy problem for simplicity. PSO would start by creating, for example, $ 100 $ particles and each particle would get a random value between the given range. For example, Particle-1 numbers would be $ x=5, y=-90 $ which would give an answer of $ 102 $. Particle-2 with the numbers $ x=-90, y=10 $ which would give the answer $ -93 $, and so on. Obviously, the answer of Particle-2 is much better so the idea here is for all the other particles to learn from Particle-2 by assigning random numbers that are close to the ones Particle-2 used before. It means that the particles would learn that it is wiser to use negative values for $ x $ and positives ones for $ y $.

Since PSO is a heuristic method with no guarantees of a perfect solution we can stop running the algorithm at a given point in time and say, OK!, What I found so far is good enough for me. In our previous work [2], however, we used PSO in a combination of other methods and we were able to decrease the running time by about $ 49\% $ of that of the normal method without compromising the quality of the solution. And that was only possible because the swarm allowed us to reflect the idea of good guessing, making PSO the Leonardo da Vinci of computer algorithms.

Interested in learning how to code that solution in Python? Then you can find the step-by-step tutorial explaining how to implement that in the following blog post.

References:

[1] Eberhart, Russ C., and James Kennedy. “A new optimizer using particle swarm theory.” In Proceedings of the sixth international symposium on micro machine and human science, vol. 1, pp. 39-43. 1995.

[2] Al-Olimat, H. S., Alam, M., Green, R., & Lee, J. K. (2015). Cloudlet Scheduling with Particle Swarm Optimization. In 2015 Fifth International Conference on Communication Systems and Network Technologies (pp. 991–995). http://link.hussein.space/ieeex58ac

[3] Bird Flock. Digital image. Wikimedia.org, n.d. Web. 26 July 2016. http://link.hussein.space/orgwi96f0.

[4] Djrhythmdave. “Takeshi’s Castle (UK Dub) - Knock Knock.” YouTube. YouTube, 15 May 2015. Web. 26 July 2016. http://link.hussein.space/takes2c9d.

Extracting and Mapping Location Mentions From Texts To The Ground

2016-05-23T15:38:00-04:00

The meaning of a social media post can be a function of location. For example, the meaning of “The Main Street bridge is closed” is ambiguous without establishing exactly which bridge is in question (The one in Danville, VA or in Columbus, OH). At the same time, location information metadata is sparse, forcing analysis of social media content and context to disambiguate alternative mappings. This article uncovers some of the persisting challenges in the recovery of location information from content and context: Text normalization, ambiguous location information, geoparsing, and the future steps of our research.

Consider the Tweet in Figure 1. It contains valuable, but implicit information for Disaster Response and Flood Modeling. Here the user provides the level of water during a storm surge. Knowledge-based inference supports the enrichment of this claim to determine that the water level is around 3 meters (to the height of a first floor)^*. If we knew the location of Ganapathi colony, this quantitative data can inform a storm surge model to predict the direction of the surge and the danger it might pose.

#ChennaiRainsHelp water hitting first floor. need to evacuate Ganapathi colony. @ChennaiRains @ndtv pic.twitter.com/rNvx4i4JVH
— Balaji (@blaji) December 2, 2015

Fig. 1: An example tweet from Twitris campaign of Chennai Flood 2015.

At Kno.e.sis center, one of the goals of our NSF-funded project (Social and Physical Sensing Enabled Decision Support is to make information available in social media accessible to first-responders to prioritize relief efforts. Mapping events to locations in order to attach ground-based information to locations on the map will allow us to achieve the desired goal. In contrast to physical sensors, the resolution of social media data is a function of population rather than specialized infrastructure. In the case of lack of sensor/IoT coverage or malfunctioning of sensors, citizen sensing can provide information about ground status to compensate for missing data.

We collect real-time, event-centric Twitter data to understand social perceptions. During the 2015 Chennai floods, we collected around 508K relevant tweets and determining locations is an important part of making these tweets more informative. Next, I will discuss how location extraction and mapping is performed, mainly in two steps: Toponym extraction and geoparsing.

Toponym Extraction

Toponym extraction is the process of extracting names of places from texts such as street names, points of interest (POI), cities, countries, and so on. There are two traditional ways to extract toponyms from texts: a supervised approach and a gazetteer-matching approach.

Supervised approaches. In the supervised approach, we train the model using manually annotated data of location mentions [1]. Supervised approaches tend to suffer from the underestimation of missing data (aka. Data Sparsity). They require sufficient amount of annotated data from, mainly, the same data source (e.g., microblog text), to enable location detections from similar data sources. However, the gazetteer approach discussed next has its own difficulties and issues that must be solved to extract locations from texts efficiently.

Gazetteers approaches. In the gazetteer approach, we extract location mentions on the fly without using any training dataset. Gazetteer approaches often use syntactic parse trees (for noun phrase extraction), direction and distance markers, gazetteers, dictionaries and many other knowledge bases in order to extract locations from texts [2-4]. Figure 2 shows part of the parse tree built using NLTK of the tweet text in Figure 3. Parsing the tweet text allows us to find noun phrases using the NLTK’s cascaded chunk parser. The parser matches a set of predefined rules to text. For example, the rule (VP: {<VB.*><NP|PP|CLAUSE>+$}) allows us to detect and extract the noun phrase (NP) “SRM university” which follows the preposition (PP) “Near”.

Please evacuate us from area Near SRM university, kattankulathur. No food.. No current.. Nothing is here . #ChennaiRains #ChennaiRainsHelp
— Rohit kumar singh (@imRohit09) December 1, 2015

Fig. 3: Tweet mentioning the toponyms “SRM university” and “kattankulathur”.

Similarly, direction and distance markers allow us to retrieve toponyms the markers are pointing at. For example, in Figure 4. The direction marker “south of” points at the toponym “101 Fwy” which is then added to our list of potential geo-parsable toponym names.

THIS JUST IN: Power out in #StudioCity after reports of loud explosion in area south of 101 Fwy between Woodman Ave and Coldwater Canyon.
— ABC7 Eyewitness News (@ABC7) April 7, 2016

Fig. 4: Tweet mentioning a toponym (“101 Fwy”) pointed at by a direction marker

Two challenges arise in Toponym extraction: Text normalization and ambiguous location information.

Text normalization. Text normalization involves subtasks such as abbreviations and acronyms expansion and misspelling corrections. Figure 5 shows an example of a tweet with such difficulties. The author of the tweet used “Rd” as an abbreviation of “road”. Moreover, the text “Kilpauk Garden” is incomplete relative to the Gazetteer name “Kilpauk, Aspiran Garden Colony”.

2 ppl need to be evacuated frm #21, New Avadi Rd, Kilpauk Garden (Nr. Zam Zam Bakery) Contact 9841546767 #ChennaiRainsHelp #chennairains
— Raj!n! Followers™ (@RajiniFollowers) December 1, 2015

Fig. 5: An example tweet with abbreviations (Rd) and incomplete information (Kilpauk Garden).

Locations can also be embedded in hashtags or usernames. For example, both @yankeestadium and #YankeeStadium refer to the location name “Yankee Stadium”. Therefore, such location mentions can also be extracted using a word segmentation (tokenization) method such as WordSegment which uses a statistical method find word boundaries.

Ambiguous Location Information. Location information is not always explicit. The relative directionality and distance content noted above hints at this problem. Consider the following tweet (Figure 6) as a more challenging example:

Our house in Chennai fully flooded. Everyone evacuated safely. My brother trying to get there from Mumbai...
— Karun Chandhok (@karunchandhok) December 2, 2015

Fig. 6: A tweet showing an ambiguous mention of a location.

In this example, a renowned author (Indian racing driver Karun Chandhok) is referring to his parent’s house. Ideally, the location of the house could be extracted from a knowledge base. The extracted toponym “our house” should then be mapped to an absolute location name. This extracted piece of information can then provide us with the fact that people are evacuating from his parents’ area.

geoparsing

geoparsing contrasts with geocoding, and both follow toponym extraction. Geocoding works with unambiguous location references such as postal addresses to specify a location on Earth using coordinates (latitude and longitude). geoparsing is similar to Geocoding but differs in that it works with ambiguous location references in unstructured texts (such as tweets).

geoparsing can be performed through a gazetteer matching process that allows us to retrieve all the metadata of the matched location. OpenStreetMap, for example, provides information such as the bounding box, the latitude and longitude, the full address, the class of the location name (Map Features), and the full display name of the matched toponym. The information extracted after a successful gazetteer matching pinpoints the toponym on the map and attaches to it the extracted metadata. The following map (Figure 7) shows the mapped toponym from the tweet in Figure 5.

Fig. 7: Pinpointing the full location name of the extracted toponym from the tweet in Figure 5: “No. 17/7, New Avadi Road, Kilpauk, Aspiran Garden Colony, Kilpauk, Chennai, Tamil Nadu 600010, India”

A typical gazetteer matching task requires complex text normalization and missing data restoration. To overcome some of the difficulties posed by Twitter data we typically use fuzzy text matching during toponym extraction, in addition to the previously discussed text normalization process. As for the incompleteness of gazetteers, a combination of one or more additional knowledge bases can be used. An example of such dictionaries is a list of points of interest^* that can be retrieved from an external data source.

Other things our research is concerned about is the problem of disambiguation during geoparsing. If a toponym name has many records in the gazetteer, the method should reasonably disambiguate which location the tweet was referring to. This problem includes the Whole-Part Relationship (i.e., which section of the road and which campus of a university). Using the provided context from text as shown in Figure 4, where the toponym “101 Fwy” is supposed to be “between Woodman Ave and Coldwater Canyon”, can tremendously help in solving such problems. Our research is currently investigating such problems and possible solutions.

Updates:

In 2018, we published a gazetteer-based approach (LNEx) that solves the problem of location extraction and linking with an average F1-score of 81%.
In 2019, we submitted a paper on geoparsing spatial expressions like “near” and “between” which follows the recognition of atomic toponyms extracted by our tool LNEx. Link to paper

References:

[1] Lingad, John, Sarvnaz Karimi, and Jie Yin. “Location extraction from disaster-related microblogs.” In Proceedings of the 22nd international conference on World Wide Web companion, pp. 1017-1020. International World Wide Web Conferences Steering Committee, 2013.

[2] Gelernter, Judith, and Shilpa Balaji. “An algorithm for local geoparsing of microtext.” GeoInformatica 17, no. 4 (2013): 635-667.

[3] Shervin Malmasi, Mark Dras. “Location Mention Detection in Tweets and Microblogs”. Computational Linguistics. Volume 593 of the series Communications in Computer and Information Science pp 123-134. Springer February 2016.

[4] Middleton, Stuart E., Lee Middleton, and Stefano Modafferi. “Real-time crisis mapping of natural disasters using social media.” Intelligent Systems, IEEE 29, no. 2 (2014): 9-17.

Hussein S. Al-Olimat, PhD

ArabicNLP Dataset

Particle Swarm Optimization: Python Tutorial

The learning rate part:

The cognitive part:

The social part:

Heuristics: the art of good guessing

References:

Extracting and Mapping Location Mentions From Texts To The Ground

Toponym Extraction

geoparsing

Updates:

References: