How to extract data from text field in pandas dataframe?

Question

I want to obtain distribution of tags from this dataframe:

df=pd.DataFrame([
    [43,{"tags":["webcom","start","temp","webcomfoto","dance"],"image":["https://image.com/Kqk.jpg"]}],
    [83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
    [76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
    [77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
    [81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])

I need to get a table with number of 'id's with specific number of tags. eg.

 Number of posts | Number of tags 
      31                9
      44                8
      ...
      129               1

I used this approach for the case when 'tags' is the only field. In this dataframe I also have 'image', 'users' and other text fields with values. How should I handle data in this case?

Thank you

score 1 · Answer 1 · answered Jun 06 '18 at 16:05

1

Sticking with collections.Counter, here's one way:

from collections import Counter
from operator import itemgetter

c = Counter(map(len, map(itemgetter('tags'), df['tags'])))

res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['Tags', 'Posts']

print(res)

   Tags  Posts
0     5      2
1     3      1
2     2      1
3     1      1

answered Jun 06 '18 at 16:05

jpp

159,742
34
281
339

Thanks, it worked on the sample data, but when I apply to the real dataframe I get TypeError: string indices must be integers. What could cause this? Thanks – user40 Jun 06 '18 at 16:15
@user40, Not sure, I'm afraid. I can have a look if you're able to edit your question with some data which replicates your issue. – jpp Jun 06 '18 at 16:32

Scott Boston · Answer 2 · 2018-06-06T16:36:35.963

1

You could use the str accessor to get to dictionary keys and len with value_counts:

df.tags.str['tags'].str.len().value_counts()\
  .rename('Posts')\
  .rename_axis('Tags')\
  .reset_index()

Output:

   Tags  Posts
0     5      2
1     3      1
2     2      1
3     1      1

edited Jun 06 '18 at 16:36

answered Jun 06 '18 at 16:14

Scott Boston

147,308
15
139
187

jezrael · Accepted Answer · 2018-06-06T16:48:21.790

1

There is problem your data in column tags are strings, no dictionaries.

So need first step:

import ast

df['tags'] = df['tags'].apply(ast.literal_eval)

and then apply original answer, working very nice if multiple fields.

Verifying:

df=pd.DataFrame([
    [43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
    [83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
    [76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
    [77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
    [81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)

#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)

print (df['tags'].apply(type))
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: tags, dtype: object

#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)

print (df['tags'].apply(type))
0    <class 'dict'>
1    <class 'dict'>
2    <class 'dict'>
3    <class 'dict'>
4    <class 'dict'>
Name: tags, dtype: object

c = Counter([len(x['tags']) for x in df['tags']])

df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
   Number of posts   Number of tags 
0                1                 0
1                1                 3
2                1                 2
3                1                 5
4                1                 1

edited Jun 06 '18 at 16:48

answered Jun 06 '18 at 16:43

jezrael

822,522
95
1,334
1,252

I get this error: ValueError: malformed node or string: {'tags': ['webcom', 'start', 'vpered', 'webcomfotografiya', 'privet'], 'image': ['https://image.com/Kqk.jpg']} – user40 Jun 06 '18 at 16:54
If use `df['tags'].apply(type)` it return `dict` in your original data? What is source of data? – jezrael Jun 06 '18 at 16:56
print (df['tags'].apply(type)) gives 0 ... Name: tags, dtype: object – user40 Jun 06 '18 at 16:58
What is source of data? `json`? `csv` ? – jezrael Jun 06 '18 at 16:59
it worked on the sample data thanks. checking with the original data – user40 Jun 06 '18 at 17:00
source is mongodb, so json – user40 Jun 06 '18 at 17:00
@user40 - So problem is how convert `string`s to `dictionaries`, is possible use [this solution](https://stackoverflow.com/a/38610160/2901002) ? – jezrael Jun 06 '18 at 17:09
looks like it. what should be considered as 'data' in df = json_normalize(data) ? – user40 Jun 06 '18 at 17:20
`data` are json strings, from mongodb. – jezrael Jun 06 '18 at 17:21
so I use this df['metadata'] = json_normalize(df['metadata']), metadata is the actual column name in the original dataframe. but I get this error AttributeError: 'str' object has no attribute 'values' – user40 Jun 06 '18 at 17:23
How create `DataFrame` from mongodb? – jezrael Jun 06 '18 at 17:24
Here is problem you need convert string to dictionaries before creating `DataFrame`, so `json_normalize(df['metadata'])` cannot work. – jezrael Jun 06 '18 at 17:30
I find [this](https://stackoverflow.com/a/41240556/2901002) last paragrpah is about `json_normalize` – jezrael Jun 06 '18 at 17:34
so I obtained a dataframe with tags only, added the first step as you suggested in this answer: df['tags'] = df['tags'].apply(ast.literal_eval) and it takes a while to count but then I get error ValueError: malformed node or string: nan – user40 Jun 07 '18 at 12:56
@user40 So it means there is at least one value in data is problematic. First step is find problem. I suggest use [this answer](https://stackoverflow.com/a/49478085), wrap sofisticated solution to function `def get_error(x): try: ast.literal_eval(x) except ValueError as ex: _exc_type, exc_valu......` and what tags cannot be processes. – jezrael Jun 07 '18 at 14:25
@user40 another solution if data are not confidental is send it to my email in my profile and I try find problem. – jezrael Jun 07 '18 at 14:27
the original data from the database was extremely 'dirty', after cleaning the data everything worked just fine! Thank you for your time. – user40 Jun 08 '18 at 19:07

Haleemur Ali · Answer 4 · 2018-06-06T16:32:06.893

Update: Use a combination of f-strings, dictionary comprehension & list comprehension to extract the length of all lists in your tags column concisely:

extract_dict = [{f'count {y}':len(z) for y,z in x.items()} for x in df.tags]

# construct new df with only extracted counts
pd.DataFrame.from_records(extract_dict)

# new df with extracted counts & original data
df.assign(**pd.DataFrame.from_records(extract_dict))

# outputs:

   _id                                               tags  count image  \
0   43  {'tags': ['webcom', 'start', 'temp', 'webcomfo...          1.0
1   83  {'tags': ['yourself', 'start', ''], 'image': [...          1.0
2   76  {'tags': ['en', 'webcom'], 'links': ['http://w...          NaN
3   77  {'tags': ['webcomznakomstvo', 'webcomzhiznx', ...          2.0
4   81  {'tags': ['webcomfotografiya'], 'users': ['mys...          NaN

   count links  count tags  count users
0          NaN           5          NaN
1          NaN           3          NaN
2          2.0           2          1.0
3          NaN           5          NaN
4          1.0           1          2.0

original answer:

If you know the column names in advance, it is possible to use list comprehension for this task

 extract = [(len(x.get('tags',[])), len(x.get('images',[])), len(x.get('users',[])))     
  for x in df.tags]
 # extract outputs:
 [(5, 0, 0), (3, 0, 0), (2, 0, 1), (5, 0, 0), (1, 0, 2)]

which can then be used create a new dataframe or to assign additional columns

# creates new df
pd.DataFrame.from_records(
  extract, 
  columns=['count tags', 'count images', 'count users']
)

# creates new dataframe with extracted data and original df
df.assign(
    **pd.DataFrame.from_records(
        extract, 
        columns=['count tags', 'count images', 'count users'])
)

The last statement produced the following output:

   _id                                               tags  count tags  \
0   43  {'tags': ['webcom', 'start', 'temp', 'webcomfo...           5
1   83  {'tags': ['yourself', 'start', ''], 'image': [...           3
2   76  {'tags': ['en', 'webcom'], 'links': ['http://w...           2
3   77  {'tags': ['webcomznakomstvo', 'webcomzhiznx', ...           5
4   81  {'tags': ['webcomfotografiya'], 'users': ['mys...           1

   count images  count users
0             0            0
1             0            0
2             0            1
3             0            0
4             0            2

How to extract data from text field in pandas dataframe?

4 Answers4