Define row and column separators for a data frame

Question

I imported a dataset that unfortunately did not have any separators defined, nor in columns or in rows. I have tried to look for an option to define a specific row separator but could not find one that could be applicable to this situation.

df1 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
                  {lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
                  {lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
                  {lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
df2  <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
                  {lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
                  {lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
                  {lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
newdf <- rbind(df1,df2)

This is a model of the data that I am currently struggling with. Ideally, the row separators in this case would have to be defined as "},{" and the column separators as ",". I tried subsetting this pattern to a tab and defining a different separator but this either returned an error (tried with separate_rows from TidyR) or simply did nothing.

Hope you guys can help

How about `subset(read.csv(text = paste(gsub("[{}]", "\n", newdf$V1), collapse="\n"), header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE), V1 != "")` ` — akrun, Jan 27 '19 at 19:05
This worked, the columns are not named and 1/2 rows is empty, though no data is lost. Thank you! — Thejohn, Jan 27 '19 at 19:55

r2evans · Accepted Answer · 2019-02-01T23:14:51.063

This looks like incomplete (incorrect) JSON, so I suggest you bring it up-to-spec and then parse it with known tools. Some problems, easily mitigated:

sqk should have a comma-separator, perhaps a copy/paste issue. This might be generalized as any "number-letter" progression depending on your process. (Edit: your update seems to have resolved this issue, so I'll remove it. If you still need it, I recommend you go with a very literal gsub("([^,])sqk:", "\\1,sql:", s).)
Labels (e.g., lat, alt, sql) should all be double-quoted.
Non-numeric data needs to be quoted, specifically the dates.
Exception to 3: null should remain unquoted.
There are multiple "dicts" that need to be within a "list", i.e., from {...},{...} to [{...},{...}].

Side note with your data: I read them in with stringsAsFactors=FALSE, since we don't need factors.

fixjson <- function(s) {
  gsub(",+", ",",
       paste(
         gsub('"sqk":([^,]+)', '"sqk":"\\1"',
              gsub("\\s*\\b([A-Za-z]+)\\s*(?=:)", '"\\1"', # note 2
                   gsub('(?<=:)"(-?[0-9.]+|null)"', "\\1", # notes 3, 4
                        gsub("(?<=:)([^,]+)\\b", "\"\\1\"", # quote all data
                             s, perl = TRUE), perl = TRUE), perl = TRUE)),
         collapse = "," )
       )
}
fixjson(df1$V1)
# [1] "{\"lat\":45.493,\"lng\":-76.4886,\"alt\":22400,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":328,\"spd\":null,\"postime\":\"2019-01-15 16:10:39\"},\n                  {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":23425,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n                  {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n                  {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"}"

From here, we use a well-defined json parser (from either jsonlite or RJSONIO, both use similar APIs):

jsonlite::fromJSON(paste("[", fixjson(df1$V1), "]", sep=""))
#       lat      lng   alt  call  icao registration  sqk trak spd             postime
# 1 45.4930 -76.4886 22400 COFPQ C056P       X-VLMP 6232  328  NA 2019-01-15 16:10:39
# 2 45.5049 -76.5285 23425 COFPQ C056P       X-VLMP 6232  321  NA 2019-01-15 16:11:50
# 3 45.5049 -76.5285 24000 COFPQ C056P       X-VLMP 6232  321  NA 2019-01-15 16:11:50
# 4 45.5049 -76.5285 24000 COFPQ C056P       X-VLMP 6232  321  NA 2019-01-15 16:11:50

From here, rbind as needed. (Note that the null literal was translated into R's NA, which is "as it should be" in my opinion.)

Follow-on suggestion: you can use as.POSIXct directly on your postime column; I hope you are certain all of your data are in the same timezone since the field contains no hint.

Lastly, you mentioned something about non-ASCII characters gumming up the works. My recent edit included a little added robustness for spaces introduced from the use of iconv (e.g., the use of \\s*), so the following might suffice for you:

jsonlite::fromJSON( paste("[", fixjson(iconv(df2$V1, "latin1", "ASCII", sub="")), "]") )

(Use of iconv suggested by https://stackoverflow.com/a/9935242/3358272)

Yeah, its an imported HTML source with google maps information, with locations of nodes. I'll try your solution asap and let you know how it goes. — Thejohn, Jan 27 '19 at 19:20
Though minor, I replaced hard-coded double-quotes with the use of `dQuote`, in the (unlikely?) event that some of the data includes an embedded double-quote. There are several ways this can go wrong, so (1) if you're certain that no other fields will be included, you do not need this last edit; (2) if data is ever quoted, this will double-quote them, not good, could be guarded-against; (3) if something else happens, who knows how this will break. — r2evans, Jan 27 '19 at 19:25
Just tried it, the original data also includes some mixes of numeric and character classes such as "icao:C065E0" or call:CGMPB. Do you suggest just removing these? Alternatively, I can add them to the dataset? — Thejohn, Jan 27 '19 at 19:29
Error: lexical error: invalid char in json text. [{â€œlatâ€:45.493,â€œlngâ€:-76. (right here) ------^ Process finished, though I am left with an error when trying to apply the json line. — Thejohn, Jan 27 '19 at 19:41
Sounds like an encoding issue. There are several Q/As on SO about encoding issues, I can't look for them at the moment but if you're still stuck later I'll see what I can find. Perhaps somebody else can help, too ... — r2evans, Jan 27 '19 at 20:13
Okay, your `"icao:..."` and such suggests the `gsub` needs to look for word-boundaries, not just alphanum. I'll edit a little later. (If you can update the original dataset to include those examples, it would help a little.) — r2evans, Jan 27 '19 at 20:16
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187402/discussion-between-thejohn-and-r2evans). — Thejohn, Jan 27 '19 at 23:07
@Thejohn, answer updated per chat. Sorry it took so long (to compute as well as for me to get back here and update the answer). — r2evans, Feb 01 '19 at 23:15

Define row and column separators for a data frame

1 Answers1