Parsing a string - unable to determine how to separate concatenated words

Question

I don't have much code to show because it does not directly relate to a code issue, but I am having trouble parsing text in C++ and I need help. I am unable to find a solution elsewhere.

I have a KML file at this link.

Inside of that is text that was generated from the US National Weather Service. The text is the following text :

Shower activity associated with a tropical wave over the easternCaribbean Sea remains disorganized.  This system is expected to movewest-northwestward with no significant development, producinglocally heavy rainfall over Puerto Rico, Hispaniola, and portions ofthe southeastern Bahamas during the next few days. Over the weekend,conditions could become marginally conducive for development whenthe disturbance moves near Florida and the central and northwesternBahamas.

I am still a novice programmer and I am having trouble with this text. Notice words like andthunderstorms are placed together. In my attempt to parse this text and place a space between the words, I looked for escape sequences that would be causing this, first being "\n". This did not work.

I can't find any way to separate these words.

I decided to see if the words were actually placed together or not by using this code to find it and return something other than zero :

int findWord = KML.find("andthunderstorms");

This returns a positive value, so this leads me to believe that there is no weird formatting causing this... and the text just is delivered that way. The problem is that I don't see that being true, it does not make sense for a large organization to send out weather data improperly formatted. In addition to that, I am doing a project in Mapbox using this text and it does not display the text at all as it is. This usually happens if there is an escape sequence it does not like - it won't load anything. This is why I believe it has something to do with the text itself.

How can I find out what is causing this? I am not asking anyone to write code for me, I just need a place to start.

Put your file somewhere other than dropbox -- few will create an account just to download your file. pastbin or like do not require accounts to access such material. You seem to have provided a line in your question that should satisfy the [Minimal, Complete, and Verifiable Example (MCVE)](http://stackoverflow.com/help/mcve) requirement. — David C. Rankin, Jul 31 '19 at 06:49
Thank you. If I may ask, how did you come to that conclusion? If it's able to be found using str.find(), does that positively mean it is not separated by anything? — David, Jul 31 '19 at 06:49
@David I used an editor (Notepad++) that has an option to show all characters in a file, including normally invisible escape sequences etc. If I had to guess I would say that the original text contained line breaks at the points where two words run together, and some automatic process has stripped the line breaks before putting the text into a KML file. — john, Jul 31 '19 at 06:51
Since this is NWS forecast data, there are a limited number of words used. You can build a lookup-table of those words that would allow you to identify when they are combined. E.g `"north", "south", "east", "west", "northern", "southern", "eastern", "western", ... "showers", "storms", "thunderstorms", etc...` When you encounter a word containing any substring from the lookup-table, you can then iterate to find where in the combined word your known word starts, you can separate based on this. — David C. Rankin, Jul 31 '19 at 06:54
@David And yes, if str.find() can find a word it positively does mean that word appears exactly as is in the text. — john, Jul 31 '19 at 06:57
@DavidC.Rankin Thank you, I was just wondering how I could build something like that. It may not be fullproof, but it may eliminate much of the problem. — David, Jul 31 '19 at 06:58
The benefit is you will learn some cool weather terms building your lookup table like `"Orographic lift", "Advection Fog", ....` It is really something you can do on the fly. as you read though your NWS data, store the individual terms in a vector of string. Write it out to a file so it is persistent and usable later on. The when you have one of these things you can load your words file before you start processing the next forecast, etc... — David C. Rankin, Jul 31 '19 at 07:00
More efficient than just searching any sub-strings again and again would be some state machine or perhaps a [trie](https://en.wikipedia.org/wiki/Trie) based algorithm. More complex to build, though... — Aconcagua, Jul 31 '19 at 07:27
Thank for everyone for all the input. @Aconcagua do you think there are some sort of libraries meant for this sort of thing? Like autocomplete on a cell phone sort of thing. — David, Jul 31 '19 at 07:30
@David Not aware of a specific one, but pretty sure something alike exists. You might start at [wikipedia](https://en.wikipedia.org/wiki/String-searching_algorithm), the answers to [this question](https://stackoverflow.com/q/3183582/1312382) appear quite interesting, too. — Aconcagua, Jul 31 '19 at 07:44
If you consult your favourite search engine with the name of some of the algorithms, you might find some library, too (did so for KMP, first I stumbled upon was [this](https://en.wikibooks.org/wiki/Algorithm_implementation/String_searching/Knuth-Morris-Pratt_pattern_matcher#C++) – if correct or not, I don't know, you need to evaluate yourself...). — Aconcagua, Jul 31 '19 at 07:45
@David what forcast product are you dealing with? Prognostic, TAF, METAR? If you can identify the forecast, you can probably get a vocabulary file (or take a couple-dozen forecasts and build a vocabulary file from them. Also, where is your data coming from? It looks like the problem is whatever software is "Joining" the lines isn't smart enough to always ensure a space is placed between the last/first word during line joins. If that is something you are doing, then use a better editor (like Vim, kate/kwrite, emacs, etc...) at least try another one if the one you are using is the problem. `:)` — David C. Rankin, Jul 31 '19 at 08:15
@DavidC.Rankin I am parsing KMZ files from the National Hurricane Center. I use wget to fetch them, 7zip to unpack the KML and then I can can parse out the polygon coordinates and descriptions. From there I build a .JSON file for use with the web app I am developing. I noticed https://www.nhc.noaa.gov/gtwo.php has the same text as in the KML files, and guess what! There is
exactly at the problem points. To save myself the hassle of programming beyond my capacity, I am working on extracting the descriptions from there. I am curious about the sort of data you work with now- I am also in TX. — David, Jul 31 '19 at 08:23
@DavidC.Rankin exactly. I will send an email to the NHC about this, even though it may fall on deaf ears. — David, Jul 31 '19 at 08:40
You would be surprised. I have always found NWS very responsive to issues, be they problems with some of their sites, or their willingness to provide underlying data for analysis. I haven't corresponded with NHC directly, but I wouldn't expect their response to be any different. Good luck. — David C. Rankin, Jul 31 '19 at 12:26
@DavidC.Rankin it's been a while but I did send my email to NHC and they confirmed it was an error. Apparently it has been fixed. — David, Oct 03 '19 at 08:49
Sure `drankinatty` at good old gmail will work too. Will be later tomorrow before I am likely to get to it. — David C. Rankin, Dec 29 '19 at 11:35

Parsing a string - unable to determine how to separate concatenated words

0 Answers0