Split row string into multiple columns in R

Question

I used R and I have this string as a row and I need to split it to be as columns

'id': 1050442590754103297, 'id_str': '1050442590754103297', 'name': 'ام رودينا ', 'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None

I tried this code is worked but I need to specify the number of columns that I need and also I need to rename the columns at the end, so it is difficult and takes time

d<-str_split_fixed(try$user, ",", 4)

the result I got is and it is without columns name :

'id': 1050442590754103297    'id_str': '1050442590754103297'   'name': 'ام رودينا <U+267F>'

and the fourth column has the rest of the string

'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه<U+267F>', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None

I need a code that split the rows based on comma and makes the column name the word before (:) as shown here:

 id                         id_str                    name        screen_name     
 1050442590754103297      1050442590754103297       ام رودينا \u267f           uclkGkQ5

and the same for the rest of string hope you understand me and thank you

What are you trying to accomplish with the no-op `try$user<-try$user`? — r2evans, Feb 16 '19 at 17:23
No problem, often questions have seemingly pointless code like that because of some code-reduction that reduced components that caused it be different and to make sense. — r2evans, Feb 17 '19 at 07:02

score 2 · Answer 1 · answered Feb 16 '19 at 17:47

This mimics (but is not legal) JSON. One method (if assumptions are true) would be to "convert to JSON" and parse from there.

Up front: I have an issue on my R session where the arabic letters are not stored correctly in the string. This happens before the calls to gsub, etc, so I believe that it might work just fine on your machine. You will see empty strings in the output as a consequence of this. (Vice hunting this down locally, I wanted to give you a shot at the code first.)

Assumptions:

there are no embedded double-quotes
the True/False/None literal strings are never embedded within text as other than the logical values (e.g., 'screen_name':'Is None' does not happen)
you are willing to find NULL within your data, where there previously was :None

Important conversions:

True and False to lowercase
None to null
encapsulate the entire thing in a dictionary with surrounding { and }
convert all single-quotes to double-quotes

One can use the magrittr pipe here for readability, or you can just nest all of the functions (barely faster):

out <- jsonlite::fromJSON(
  paste0("{", gsub(":\\s*True\\s*(,?)", ":true\\1",
                   gsub(":\\s*False\\s*(,?)", ":false\\1",
                        gsub(":\\s*None\\s*(,?)", ":null\\1",
                             gsub("'", '"', s)))),
         "}"))
# or
library(magrittr)
out <- s %>%
  gsub(":\\s*True\\s*(,?)", ":true\\1", .) %>%
  gsub(":\\s*False\\s*(,?)", ":false\\1", .) %>%
  gsub(":\\s*None\\s*(,?)", ":null\\1", .) %>%
  gsub("'", '"', .) %>%
  paste0("{", ., "}") %>%
  jsonlite::fromJSON(.)

The results (compacted with str):

str(out)
# List of 39
#  $ id                                : num 1.05e+18
#  $ id_str                            : chr "1050442590754103297"
#  $ name                              : chr "          "
#  $ screen_name                       : chr "uclkGkQ5"
#  $ location                          : NULL
#  $ url                               : NULL
#  $ description                       : chr "<U+200F><U+200F>                        "
#  $ translator_type                   : chr "none"
#  $ protected                         : logi FALSE
#  $ verified                          : logi FALSE
#  $ followers_count                   : int 1567
#  $ friends_count                     : int 4019
#  $ listed_count                      : int 0
#  $ favourites_count                  : int 6669
#  $ statuses_count                    : int 9279
#  $ created_at                        : chr "Thu Oct 11 17:46:44 +0000 2018"
#  $ utc_offset                        : NULL
#  $ time_zone                         : NULL
#  $ geo_enabled                       : logi FALSE
#  $ lang                              : chr "ar"
#  $ contributors_enabled              : logi FALSE
#  $ is_translator                     : logi FALSE
#  $ profile_background_color          : chr "F5F8FA"
#  $ profile_background_image_url      : chr ""
#  $ profile_background_image_url_https: chr ""
#  $ profile_background_tile           : logi FALSE
#  $ profile_link_color                : chr "1DA1F2"
#  $ profile_sidebar_border_color      : chr "C0DEED"
#  $ profile_sidebar_fill_color        : chr "DDEEF6"
#  $ profile_text_color                : chr "333333"
#  $ profile_use_background_image      : logi TRUE
#  $ profile_image_url                 : chr "http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
#  $ profile_image_url_https           : chr "https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
#  $ profile_banner_url                : chr "https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015"
#  $ default_profile                   : logi TRUE
#  $ default_profile_image             : logi FALSE
#  $ following                         : NULL
#  $ follow_request_sent               : NULL
#  $ notifications                     : NULL

Notes:

reminder: empty spaces here are a problem with my local "locale" setting, not necessarily the process here (sorry, I don't often process non-ASCII on my console)
the large-integer id field is converted to numeric and cannot be an integer without use of the bit64 package, I don't know if this will be a problem for you
I tried to be "careful" with replacements, and be very specific with the regex patterns, looking for a preceding colon, no ticks, and an optional follow-on comma (because the last None is not followed by anything); being as specific as possible mitigates the possibility of incorrectly replacing these literals within a string
I stress again the assumption of only-single-quotes ... if double-quotes are present, they will mess up the whole thing, and I haven't put thought into dealing with them

thank you yes it was json file but I convert it to csv in python then I open it in R — Fatima, Feb 17 '19 at 06:54
So are you saying that you have an original json file laying around? Read that instead, hands-down that is the better approach. — r2evans, Feb 17 '19 at 07:01
thank you I tried it but got this error `error Error: parse error: invalid object key (must be a string) {{"id": 1050442590754103297, "id (right here) ------^` — Fatima, Feb 17 '19 at 07:02
It's going to be really difficult to troubleshoot `jsonlite` parsing errors from comments, but it looks like you have a double-brace at the beginning. If you got this from a real json file, you probably do not need to add leading `{` and trailing `}` as I did in this answer. — r2evans, Feb 17 '19 at 07:04
yes I have it but it is in python and I am not familiar with python, so I just convert the file to csv in python then read it in R, and some columns are good but some of them like what I posted — Fatima, Feb 17 '19 at 07:08
I don't understand. If you have a JSON file, read it directly into R (`jsonlite::fromJSON("somefile.json")` works). If the python code is from elsewhere and it is downloading or generate the JSON file, then just save the non-CSV-converted json text to a file (https://stackoverflow.com/q/5214578/3358272). — r2evans, Feb 17 '19 at 07:11

Split row string into multiple columns in R

1 Answers1