3

My dataframe is this and I want to split my data frame by colon (:)

+------------------+
|Name:Roll_no:Class|
+------------------+
|      #ab:cd#:23:C|
|      #sd:ps#:34:A|
|      #ra:kh#:14:H|
|      #ku:pa#:36:S|
|      #ra:sh#:50:P|
+------------------+

and I want my dataframe like:

+-----+-------+-----+
| Name|Roll_no|Class|
+-----+-------+-----+
|ab:cd|     23|    C|
|sd:ps|     34|    A|
|ra:kh|     14|    H|
|ku:pa|     36|    S|
|ra:sh|     50|    P|
+-----+-------+-----+
Cool Triks
  • 29
  • 1
  • 5

3 Answers3

5

If need split by last 2 : use Series.str.rsplit, then set columns by split column name and last remove first and last # by indexing:

col = 'Name:Roll_no:Class'
df1 = df[col].str.rsplit(':', n=2, expand=True)
df1.columns = col.split(':')
df1['Name'] = df1['Name'].str[1:-1]
#if only first and last value
#df1['Name'] = df1['Name'].str.strip('#')
print (df1)
    Name Roll_no Class
0  ab:cd      23     C
1  sd:ps      34     A
2  ra:kh      14     H
3  ku:pa      36     S
4  ra:sh      50     P
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Use read_csv() sep=':' and quotechar='#'

str = """Name:Roll_no:Class 
#ab:cd#:23:C 
#sd:ps#:34:A 
#ra:kh#:14:H 
#ku:pa#:36:S 
#ra:sh#:50:P"""

df = pd.read_csv(pd.io.common.StringIO(str), sep=':', quotechar='#')
>>> df
     Name  Roll_no Class
#0  ab:cd       23     C
#1  sd:ps       34     A
#2  ra:kh       14     H
#3  ku:pa       36     S
#4  ra:sh       50     P
jxc
  • 13,553
  • 4
  • 16
  • 34
  • @CoolTriks for pyspark, you can do the similar: `df = spark.read.csv('/file/path', quote='#', sep=':', header=True)`. Or for existing field, use two functions.split() (first by '#' and then by ':') to retrieve needed fields. – jxc Jun 07 '19 at 19:30
0

This is how you could do this in pyspark:

Specify the separator and the quote on read

If you're reading the data from a file, you can use spark.read_csv with the following arguments:

df = spark.read.csv("path/to/file", sep=":", quote="#", header=True)
df.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd|     23|    C|
#|sd:ps|     34|    A|
#|ra:kh|     14|    H|
#|ku:pa|     36|    S|
#|ra:sh|     50|    P|
#+-----+-------+-----+

Use Regular Expressions

If you're unable to change the way the data is read and you're starting with the DataFrame shown in the question, you can use regular expressions to get the desired output.

First get the new column names by splitting the existing column name on ":"

new_columns = df.columns[0].split(":")
print(new_columns)
#['Name', 'Roll_no', 'Class']

For the Name column you need to extract the data between the #s. For the other two columns, you need to remove the strings between the #s (and the following ":") and use pyspark.sql.functions.split to extract the components

from pyspark.sql.functions import regexp_extract, regexp_replace, split

df.withColumn(new_columns[0], regexp_extract(df.columns[0], r"(?<=#).+(?=#)", 0))\
    .withColumn(new_columns[1], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[0])\
    .withColumn(new_columns[2], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[1])\
    .select(*new_columns)\
    .show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd|     23|    C|
#|sd:ps|     34|    A|
#|ra:kh|     14|    H|
#|ku:pa|     36|    S|
#|ra:sh|     50|    P|
#+-----+-------+-----+
pault
  • 41,343
  • 15
  • 107
  • 149