Landfall Prediction



The aim of this exercise was to employ data science and machine learning techniques to the Atlantic Hurricane dataset in order to answer the machine learning problem: landfall prediction.

The overview of the steps taken is shown in Fig. 1. For the first part, data wrangling and feature engineering of the dataset were done in Python. For the map visualization of hurricane points on the map and to create further insights using GIS, ArcGIS was used. Finally, RapidMiner was used to test different models, and to apply a Naive Bayes classification to predict whether hurricane points would make a landfall or not.

preview



1. Data Cleaning


The revised Atlantic hurricane data (HURDAT2) is a 49105rows x 22 cols table which provides the geolocation, in latitude and longitude (and respective orientation), of the hurricanes in a 6-hr interval from their genesis to decay. Fig. 2 below summarizes the data types contained by every column. With the expression, “Garbage in = Garbage out”, it is imperative to know what our data contains and perform necessary modifications to make the dataset machine learning “ready”. Every machine learning process needs a reliable and organized dataset, especially for training.

preview


#Checking for NaN
The values -99 and -999 values were converted into NaN. Giving them a value of 0 is not very intuitive because then, the model will learn that these points or scenarios have had a value of 0 which is not really the case.



2. Feature Engineering


With a completely filled-up table, we can further inspect the data and make inferences about what’s really going on in every row and column. Relevant columns should be retained or “created” to capture the spatial pattern stored in the historical data.

#Transformation of Coordinate values - Latitude and Longitude
Latitude and Longitude values are expressed as “objects”. Each of these objects was parsed to extract the numeric part and take into account the orientation given by the string component of the object. For Latitude values having an orientation of “N” or North should be positive (+); and negative (-) if “S” or South, as referred from the equator. For Longitude, “W” or West values should be negative (-); “E” or East values should be positive, as referred from the 0-degree longitude meridian.

#Remove duplicates
There were 49,105 rows from the original dataset. From this, there were no detected duplicates using a duplicated function. Based on the unique IDs, every row is unique. By filtering with unique IDs, there will be about 38,307 unique dates and times.

#Calculating Speed, Bearing, Wind Direction, Wind Strength
To capture the movement and intensity of the hurricane, first, it is relevant to compute the speed and bearing from the data. Speed is important because it is indicative of the strength and movement of the hurricane. Strength, because speed varies from the point of genesis to decay; Movement, because as it moves through time, a distance (d) is covered whether it goes up, down, left, or right from previous location.

The formula for speed is the distance over time. The idea could be easily computed for our case, by recognizing that we have the time and location details of the hurricanes at every point in time. We can compute the distance(d) through the distance formula, and divide it by the time component per hurricane. The figure below shows my understanding of components for speed derivation.

preview


Bearing, in surveying terms, is the angle from the North-South line - can be clockwise or counter-clockwise- and indicates the direction. For this exercise, this is an important factor paired with the speed of the hurricane because it indicates the direction from point 1 to point 2 to point N on the map. With this data, we can fully define and plot the overall trajectory of the hurricane.

However, this isn’t very straightforward to compute because we have latitude and longitude -it will not be a simple x-y problem. The curvature of the Earth must be accounted for.

Following this, the corresponding wind bearing and strengths were also aggregated for low, moderate, and high wind records.

#Correlation
By this point, the important numeric components are complete - location, maximum wind, minimum pressure, speed, bearing, wind direction, and wind bearing.

To determine the measure the strength of the linear relationship between each variable, the Pearson correlation was computed. Below presents these correlations as a heat map.

preview


#Clustering
For the interest of detecting whether the hurricane will make landfall or not, the climatology of the hurricane must be investigated -i.e. the point of genesis or starting point, the sea surface temperature, the wind energy, etc. These climatology factors will determine the trajectory of hurricanes and the probability of making landfall. To overcome this, hurricanes can be clustered into groups. Parasuraman(2019) designated these 4 group categories for hurricanes on the Atlantic:

These groupings were mainly designated by the 70W deg longitudinal line and the 20N degree latitude from the equator. A simple code algorithm was developed to make a column and apply the respective grouping along all rows of the data frame.



3. Map Visualization


The updated dataset was visualized on ArcGIS. Figure below shows a sample map of every point of the hurricane and the assigned groupings.


preview


preview

#Creating a new feature - Assigning Points on Land
After determining which points are on land (by taking the intersection of the world landmass and the hurricane points), a new column was added to the dataset to indicate whether they are on land (value = 1) or not (value = 0). This was achieved by merging the point files on land and the hurricane point file to join the column of interest.

The main purpose of this “onLand” column is the provide the “labels” for our machine learning problem: Predicting whether the hurricane will make a landfall.



4. Applying Machine Learning


RapidMiner was used to solve this classification problem. The software enables data preparation, modeling, data mining, and model applications.

#Auto-modeling evaluation
The Auto model functionality was very helpful in terms of evaluating my final dataset as input data, and by providing the capability to pre-compute all relevant models for my specific machine learning problem for comparison.

preview


preview


preview


#Naive Bayes
Naive Bayes was employed for the prediction: to determine whether the hurricane will make a landfall or not. This model is simple and not computationally taxing to implement. This is so because the attributes (apart from the label) will be treated to be independent of any other attribute, hence leading to a simpler and faster calculation.

preview

preview


The “onLand” column (1=onland; 0=on water) was used to define the labels for the classification machine learning problem.

It is also important to note, that every hurricane unique rows of the dataset were retained, instead of condensing/merging into rows of only unique IDs. This is so to preserve the data for training and allow the model to learn values of speed and bearing at every point in time. As stated earlier, the computation of speed and bearing required the understanding and somewhat implementation of the distance formula, from point 1 to point 2. By doing so, we’d have a dataset that is not reliant only on the point of genesis. This makes the dataset more robust, in a way.

The data was split into 70-30 for training and validation. Naive Bayes was employed and presented here because the other model already returns a 100% (Decision Tree) classification (actual run). Other than that, I’ve had a challenge running the models Logistic Model and Generalized Linear Model because they do not accept my numeric labels (0, 1). As an example from Fig. 14, this can be overcome in Rapid Miner through a function called “Numeric to Polynominal”, which can convert my 0 and 1 values into “Yes” or “No”. Logistic Model and Generalized Linear Model do not accept the outputs from this and required the transformation into Binominal- but upon doing so, the problem is still not solved. Fast Large Margin wasn’t applied because although it is the fastest to implement, the accuracy is low and the gain is negative.

On the results pane after classification, users can explore the classification results of the model. There were helpful summaries of the data and the capability to filter and make plots. The resulting accuracy is 99.88%, with a classification error of 0.12%, as reported in Table 2. By looking at the misclassified results, the majority of them were identified to belong to Groups 1 and 3, which were located farther from land as compared to Group 4 and 2 hurricanes. In total, there were 20 misclassified and 16,388 correctly classified examples (a total of 16,408).



5. Conclusion


Data Science and Machine Learning methods were successfully implemented to the Atlantic Hurricanes dataset which included the following steps:

to solve the machine learning problem: predicting whether the hurricane will make a landfall.

Some additional insights from this exercise:

  1. Data Wrangling really took the majority of the time, around 80%, leaving around 20% of it for exploration and modeling.
  2. After data cleaning and re-designing the data, the prediction can be done through the use of software (i.e. Rapid Miner) or through coding (i.e. using sklearn, TensorFlow, Keras modules). Rapid Miner was used for this exercise to get familiarized with the software. In addition, map visualizations of the dataset was done in ArcGIS to be able to easily perform GIS calculations (i.e merging, intersection, map layout, etc) to create another feature (which will be the ‘label’ class).
  3. RapidMiner was very intuitive to use because it offered a step-by-step procedure in applying the models -to either solve a prediction, clustering, or regression problems. There were instructions and functionalities that also lead the user to understand the results.



References