3

It's seems possible to relate with Japanese Language problem, So I asked in Japanese StackOverflow also.

When I use string just object, it works fine.

I tried to encode but I couldn't find the reason of this error. Could you please give me advice?

MeCab is an open source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project. https://en.wikipedia.org/wiki/MeCab

sample.csv

0,今日も夜まで働きました。
1,オフィスには誰もいませんが、エラーと格闘中
2,デバッグばかりしていますが、どうにもなりません。

This is Pandas Python3 code

import pandas as pd
import MeCab  
# https://en.wikipedia.org/wiki/MeCab
from tqdm import tqdm_notebook as tqdm
# This is working...
df = pd.read_csv('sample.csv', encoding='utf-8')

m = MeCab.Tagger ("-Ochasen")

text = "りんごを食べました、そして、みかんも食べました"
a = m.parse(text)

print(a)# working! 

# But I want to use Pandas's Series



def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords



aa = extractKeyword(text) #working!!

me = df.apply(lambda x: extractKeyword(x))

#TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')

This is the trace error

りんご リンゴ りんご 名詞-一般       
を   ヲ   を   助詞-格助詞-一般       
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
、   、   、   記号-読点       
そして ソシテ そして 接続詞     
、   、   、   記号-読点       
みかん ミカン みかん 名詞-一般       
も   モ   も   助詞-係助詞      
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
EOS

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-174-81a0d5d62dc4> in <module>()
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260                         f, axis,
4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
4263             else:
4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4356             try:
4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
4359                     keys.append(v.name)
4360             except Exception as e:

<ipython-input-174-81a0d5d62dc4> in <lambda>(x)
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

<ipython-input-174-81a0d5d62dc4> in extractKeyword(text)
    20     """Morphological analysis of text and returning a list of only nouns"""
    21     tagger = MeCab.Tagger('-Ochasen')
---> 22     node = tagger.parseToNode(text)
    23     keywords = []
    24     while node:

~/anaconda3/lib/python3.6/site-packages/MeCab.py in parseToNode(self, *args)
    280     __repr__ = _swig_repr
    281     def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
--> 282     def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)
    283     def parseNBest(self, *args): return _MeCab.Tagger_parseNBest(self, *args)
    284     def parseNBestInit(self, *args): return _MeCab.Tagger_parseNBestInit(self, *args)

TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')w
Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
YOSUKE
  • 331
  • 3
  • 13

2 Answers2

2

I see you got some help on the Japanese StackOverflow, but here's an answer in English:

The first thing to fix is that read_csv was treating the first line of your example.csv as the header. To fix that, use the names argument in read_csv.

Next, df.apply will by default apply the function on columns of the dataframe. You need to do something like df.apply(lambda x: extractKeyword(x['String']), axis=1), but this won't work because each sentence will have a different number of nouns and Pandas will complain it cannot stack a 1x2 array on top of a 1x5 array. The simplest way is to apply on the Series of String.

The final problem is, there's a bug in the MeCab Python3 bindings: see https://github.com/SamuraiT/mecab-python3/issues/3 You found a workaround by running parseToNode twice, you can also call parse before parseToNode.

Putting all these three things together:

import pandas as pd
import MeCab  
df = pd.read_csv('sample.csv', encoding='utf-8', names=['Number', 'String'])

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse(text)
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords

me = df['String'].apply(extractKeyword)
print(me)

When you run this script, with the example.csv you provide:

➜  python3 demo.py
0                  [今日, 夜]
1    [オフィス, 誰, エラー, 格闘, 中]
2                   [デバッグ]
Name: String, dtype: object
Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
1

parseToNode fail everytime , so needed to put this code

 tagger.parseToNode('dummy') 

before

 node = tagger.parseToNode(text)   

and It's worked!

But I don't know the reason, maybe parseToNode method has bug..

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
   tagger = MeCab.Tagger('-Ochasen')
   tagger.parseToNode('ダミー') 
   node = tagger.parseToNode(text)
   keywords = []
   while node:
       if node.feature.split(",")[0] == u"名詞": # this means noun
           keywords.append(node.surface)
       node = node.next
   return keywords 
YOSUKE
  • 331
  • 3
  • 13