0

I used R and I have this string as a row and I need to split it to be as columns

'id': 1050442590754103297, 'id_str': '1050442590754103297', 'name': 'ام رودينا ', 'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None

I tried this code is worked but I need to specify the number of columns that I need and also I need to rename the columns at the end, so it is difficult and takes time

d<-str_split_fixed(try$user, ",", 4)

the result I got is and it is without columns name :

'id': 1050442590754103297    'id_str': '1050442590754103297'   'name': 'ام رودينا <U+267F>'

and the fourth column has the rest of the string

'screen_name': 'uclkGkQ5', 'location': None, 'url': None, 'description': '\u200f\u200fمن زوي الاحتياجات الخاصه<U+267F>', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1567, 'friends_count': 4019, 'listed_count': 0, 'favourites_count': 6669, 'statuses_count': 9279, 'created_at': 'Thu Oct 11 17:46:44 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ar', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None

I need a code that split the rows based on comma and makes the column name the word before (:) as shown here:

 id                         id_str                    name        screen_name     
 1050442590754103297      1050442590754103297       ام رودينا \u267f           uclkGkQ5

and the same for the rest of string hope you understand me and thank you

zx8754
  • 52,746
  • 12
  • 114
  • 209
Fatima
  • 497
  • 5
  • 21

1 Answers1

2

This mimics (but is not legal) JSON. One method (if assumptions are true) would be to "convert to JSON" and parse from there.

Up front: I have an issue on my R session where the arabic letters are not stored correctly in the string. This happens before the calls to gsub, etc, so I believe that it might work just fine on your machine. You will see empty strings in the output as a consequence of this. (Vice hunting this down locally, I wanted to give you a shot at the code first.)

Assumptions:

  • there are no embedded double-quotes
  • the True/False/None literal strings are never embedded within text as other than the logical values (e.g., 'screen_name':'Is None' does not happen)
  • you are willing to find NULL within your data, where there previously was :None

Important conversions:

  • True and False to lowercase
  • None to null
  • encapsulate the entire thing in a dictionary with surrounding { and }
  • convert all single-quotes to double-quotes

One can use the magrittr pipe here for readability, or you can just nest all of the functions (barely faster):

out <- jsonlite::fromJSON(
  paste0("{", gsub(":\\s*True\\s*(,?)", ":true\\1",
                   gsub(":\\s*False\\s*(,?)", ":false\\1",
                        gsub(":\\s*None\\s*(,?)", ":null\\1",
                             gsub("'", '"', s)))),
         "}"))
# or
library(magrittr)
out <- s %>%
  gsub(":\\s*True\\s*(,?)", ":true\\1", .) %>%
  gsub(":\\s*False\\s*(,?)", ":false\\1", .) %>%
  gsub(":\\s*None\\s*(,?)", ":null\\1", .) %>%
  gsub("'", '"', .) %>%
  paste0("{", ., "}") %>%
  jsonlite::fromJSON(.)

The results (compacted with str):

str(out)
# List of 39
#  $ id                                : num 1.05e+18
#  $ id_str                            : chr "1050442590754103297"
#  $ name                              : chr "          "
#  $ screen_name                       : chr "uclkGkQ5"
#  $ location                          : NULL
#  $ url                               : NULL
#  $ description                       : chr "<U+200F><U+200F>                        "
#  $ translator_type                   : chr "none"
#  $ protected                         : logi FALSE
#  $ verified                          : logi FALSE
#  $ followers_count                   : int 1567
#  $ friends_count                     : int 4019
#  $ listed_count                      : int 0
#  $ favourites_count                  : int 6669
#  $ statuses_count                    : int 9279
#  $ created_at                        : chr "Thu Oct 11 17:46:44 +0000 2018"
#  $ utc_offset                        : NULL
#  $ time_zone                         : NULL
#  $ geo_enabled                       : logi FALSE
#  $ lang                              : chr "ar"
#  $ contributors_enabled              : logi FALSE
#  $ is_translator                     : logi FALSE
#  $ profile_background_color          : chr "F5F8FA"
#  $ profile_background_image_url      : chr ""
#  $ profile_background_image_url_https: chr ""
#  $ profile_background_tile           : logi FALSE
#  $ profile_link_color                : chr "1DA1F2"
#  $ profile_sidebar_border_color      : chr "C0DEED"
#  $ profile_sidebar_fill_color        : chr "DDEEF6"
#  $ profile_text_color                : chr "333333"
#  $ profile_use_background_image      : logi TRUE
#  $ profile_image_url                 : chr "http://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
#  $ profile_image_url_https           : chr "https://pbs.twimg.com/profile_images/1059769079790268416/sJpep_V8_normal.jpg"
#  $ profile_banner_url                : chr "https://pbs.twimg.com/profile_banners/1050442590754103297/1539390015"
#  $ default_profile                   : logi TRUE
#  $ default_profile_image             : logi FALSE
#  $ following                         : NULL
#  $ follow_request_sent               : NULL
#  $ notifications                     : NULL

Notes:

  • reminder: empty spaces here are a problem with my local "locale" setting, not necessarily the process here (sorry, I don't often process non-ASCII on my console)
  • the large-integer id field is converted to numeric and cannot be an integer without use of the bit64 package, I don't know if this will be a problem for you
  • I tried to be "careful" with replacements, and be very specific with the regex patterns, looking for a preceding colon, no ticks, and an optional follow-on comma (because the last None is not followed by anything); being as specific as possible mitigates the possibility of incorrectly replacing these literals within a string
  • I stress again the assumption of only-single-quotes ... if double-quotes are present, they will mess up the whole thing, and I haven't put thought into dealing with them
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank you yes it was json file but I convert it to csv in python then I open it in R – Fatima Feb 17 '19 at 06:54
  • 1
    So are you saying that you have an original json file laying around? Read that instead, hands-down that is the better approach. – r2evans Feb 17 '19 at 07:01
  • thank you I tried it but got this error `error Error: parse error: invalid object key (must be a string) {{"id": 1050442590754103297, "id (right here) ------^` – Fatima Feb 17 '19 at 07:02
  • 1
    It's going to be really difficult to troubleshoot `jsonlite` parsing errors from comments, but it looks like you have a double-brace at the beginning. If you got this from a real json file, you probably do not need to add leading `{` and trailing `}` as I did in this answer. – r2evans Feb 17 '19 at 07:04
  • yes I have it but it is in python and I am not familiar with python, so I just convert the file to csv in python then read it in R, and some columns are good but some of them like what I posted – Fatima Feb 17 '19 at 07:08
  • 1
    I don't understand. If you have a JSON file, read it directly into R (`jsonlite::fromJSON("somefile.json")` works). If the python code is from elsewhere and it is downloading or generate the JSON file, then just save the non-CSV-converted json text to a file (https://stackoverflow.com/q/5214578/3358272). – r2evans Feb 17 '19 at 07:11