1

In this moment my code looks like the following. It's pretty simple, it just reads in a data file and grabs out all the interesting bits and prints them out. The trouble is, the way it prints them out is wrong, the order is incorrect.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class text_processing
{
    @SuppressWarnings("resource")
    public static void main(String[] args) throws IOException
    {
        String text; 
        BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/1_February/brute_force/items.csv"));

        while ((text = br.readLine()) != null) 
        {
            //the main character
            Pattern pat_0 = Pattern.compile( "『(.*?)』" );
            Matcher mat_0 = pat_0.matcher( text );
            if( mat_0.find() )
            {
                System.out.println( mat_0.group(1) );
            }
            //the pin yin
            Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>(?:(?!<script>).)*");
            Matcher mat_1 = pat_1.matcher( text );
            if( mat_1.find() )
            {
                System.out.println( mat_1.group(1) );
            }
            //the ubiquitous radical 
            Pattern pat_2 = Pattern.compile( "<span class=\"\"b\"\">部首:</span>" ); 
            Matcher mat_2 = pat_2.matcher( text );
            if( mat_2.find() )
            {
                Pattern pat_3 = Pattern.compile("<span class=\"\"b\"\">部首:</span>(.*?)<span class=\"\"b\"\">");
                Matcher mat_3 = pat_3.matcher( text );
                if( mat_3.find() )
                {
                    System.out.println("部首:" + mat_3.group(1) );
                }
                //stroke count
                Pattern pat_4 = Pattern.compile(mat_3.group(1) + "<span class=\"\"b\"\">部首笔画:</span>(.*?)<span class=\"\"b\"\">");
                Matcher mat_4 = pat_4.matcher( text );
                if( mat_4.find() )
                {
                    System.out.println("笔画:" + mat_4.group(1) );
                }

            }
            else
            {
                //simple rad
                Pattern pat_5 = Pattern.compile("简体部首:</span>(.*?)<span class=\"\"b\"\">");
                Matcher mat_5 = pat_5.matcher( text );
                if( mat_5.find() )
                {
                    System.out.println("简体部首:" + mat_5.group(1) );

                    //stroke count
                    Pattern pat_6 = Pattern.compile(mat_5.group(1) + "<span class=\"\"b\"\">部首笔画:</span>(.*?)<span class=\"\"b\"\">");
                    Matcher mat_6 = pat_6.matcher( text );
                    if( mat_6.find() )
                    {
                        System.out.println("简体笔画:" + mat_6.group(1) );
                    }
                }

              //trad rad
                Pattern pat_7 = Pattern.compile("繁体部首:</span>(.*?)<span class=\"\"b\"\">");
                Matcher mat_7 = pat_7.matcher( text );
                if( mat_7.find() )
                {
                    System.out.println("繁体部首:" + mat_7.group(1) );

                    //stroke count
                    Pattern pat_8 = Pattern.compile(mat_7.group(1) + "<span class=\"\"b\"\">部首笔画:</span>(.*?)<span class=\"\"b\"\">");
                    Matcher mat_8 = pat_8.matcher( text );
                    if( mat_8.find() )
                    {
                        System.out.println("繁体笔画:" + mat_8.group(1) );
                    }
                }
            }

            //the decomposition
            Pattern pat_9 = Pattern.compile("#################,\" ]:(.*?)\\(");
            Matcher mat_9 = pat_9.matcher( text );
            if( mat_9.find() )
            {
                System.out.println("首尾分解: " + mat_9.group(1) );
            }
        }
    }
}

I don't have control over how the data is structured.

Perhaps there's some kind of LinkedList object I can use to populate, and I can fill in the correct order upon each iteration, finally printing that out in the end. does that make sense? If yes, that's good, but I have no idea how to actually implement something like that. If no, what would work better?

The output currently looks like this:

首尾分解: 占乂
卥
xī
简体部首:丨 
简体笔画:1 
繁体部首:卜 
繁体笔画:2 
首尾分解: 巛乙
巤
liè
部首:巛 
笔画:3 
首尾分解: 工页
项
xiàng
简体部首:页 
简体笔画:6 
繁体部首:頁 
繁体笔画:9 

How I want it to look is:

卥
xī
首尾分解: 占乂
简体部首:丨 
简体笔画:1 
繁体部首:卜 
繁体笔画:2 

巤
liè
首尾分解: 巛乙
部首:巛 
笔画:3 

项
xiàng
首尾分解: 工页
简体部首:页 
简体笔画:6 
繁体部首:頁 
繁体笔画:9

How the data looks:

#######################," ]:占乂(zhancha)
","<table width=""620"" border=""0"" cellpadding=""0"" cellspacing=""0"">
<tr bgcolor=""#FFFFFF"">
<td width=""100""><div id=""zibg""><p class=""U5365""></p></div></td>
<td width=""510"" style=""padding-left:10px"">
<p class=""text15"">
『卥』 <br>
<span class=""b"">拼音:</span><span class=""pinyin"">xī<script>Setduyin('Duyin/xi1')</script></span> <span class=""b"">注音:</span><span class=""pinyin"">ㄒㄧ<script>Setduyin('Duyin/xi1')</script></span><br>
<span class=""b"">简体部首:</span>丨 <span class=""b"">部首笔画:</span>1 <span class=""b"">总笔画:</span>8<br><span class=""b"">繁体部首:</span>卜 <span class=""b"">部首笔画:</span>2 <span class=""b"">总笔画:</span>8<br><span class=""b"">康熙字典笔画</span>( 卥:8; )
</p></td>
</tr>
</table>"
#######################," ]:巛乙(chuanyi)
","<table width=""620"" border=""0"" cellpadding=""0"" cellspacing=""0"">
<tr bgcolor=""#FFFFFF"">
<td width=""100""><div id=""zibg""><p class=""U5DE4""></p></div></td>
<td width=""510"" style=""padding-left:10px"">
<p class=""text15"">
『巤』 <br>
<span class=""b"">拼音:</span><span class=""pinyin"">liè<script>Setduyin('Duyin/lie4')</script></span> <span class=""b"">注音:</span><span class=""pinyin"">ㄌㄧㄝˋ<script>Setduyin('Duyin/lie4')</script></span><br>
<span class=""b"">部首:</span>巛 <span class=""b"">部首笔画:</span>3 <span class=""b"">总笔画:</span>15<br><span class=""b"">康熙字典笔画</span>( 巤:15; )
</p></td>
</tr>
</table>"
  • Regex with markup. The nightmare. Use a parser like [JSoup](http://jsoup.org/) and forget you ever coded that. As a warning, have a look at [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) post. – Mena Feb 12 '15 at 10:18
  • haha what?! no way man I worked on that all day! what the hell is jsoup? –  Feb 12 '15 at 10:19
  • but anyway, also, would that even fix the problem? parsing it is not the issue, printing it is. it parses ok. –  Feb 12 '15 at 10:20
  • I just linked it in my previous comment. You might have worked on that all day, but believe me, you'll spend much more than that to get your end result, and go nuts every time there's a change in the data structure. Trust me on this one. Don't use regex to parse markup. – Mena Feb 12 '15 at 10:21
  • hmm. ok cool. thank you for that insight, certainly will help me to become a better programmer. but... i'm so damn close, there's no way i can forget about this thing before i make it work at least once, you know what i mean? –  Feb 12 '15 at 10:23
  • The issue here is probably what you are expecting your output to be. I am finding it difficult to see how pattern matching (i.e. `regex`ing is going to solve this problem). For example, you are reading the first line (based on matching) and pushing it to a specific location. Do you happen to have a set of patterns that you will always have to match? – ha9u63a7 Feb 12 '15 at 10:24
  • can't i just put those extracted things to a linked list or something and determine the order? seems like i should be able to do something like that right? –  Feb 12 '15 at 10:26
  • Yeah I get your point, but it might be a while for anyone dedicated enough to dig into that code and help you out, if ever. In a general way, I would suggest starting over with a dedicated parser and build your own objects with the various properties, so it'll be trivial later on to decide how to print them. – Mena Feb 12 '15 at 10:26
  • @Mena I think you are trying to say "Too many Mandarin characters for us to undertand :p" – ha9u63a7 Feb 12 '15 at 10:28
  • but those characters could be anything, data is data –  Feb 12 '15 at 10:29
  • @ha9u63ar that is part of the problem but once you go past the Mandarin characters and look at the code, there be the real dragons. – Mena Feb 12 '15 at 10:29
  • 1
    It's amazing that you and [this user](http://stackoverflow.com/q/28471220/4125191) seem to be tackling the same problem. Maybe you should put your heads together to solve it. – RealSkeptic Feb 12 '15 at 10:29
  • @RealSkeptic lovely! And it already has an answer, albeit debatable. – Mena Feb 12 '15 at 10:31
  • why is that amazing, Yamada is my partner, we've been putting our heads together –  Feb 12 '15 at 10:31
  • This is a question for CodeReview. – barq Feb 12 '15 at 10:34
  • 2
    Then maybe we can convince the both of you to quit using regex for parsing HTML before you run into, say, a commented out portion of the HTML and the Elder Gods become angry? You want to solve a problem, solve it *properly*. – RealSkeptic Feb 12 '15 at 10:34
  • oOo thats why people say don't parse html with regex?! that makes sense I guess. –  Feb 12 '15 at 10:38
  • 3
    @barq You're wrong about this being a question for codereview.se. When the code does not do what the asker wants the code to do, we call the code broken and will close the question. This question could be on-topic once the code was fixed. – Pimgd Feb 12 '15 at 11:02
  • You are right, OP confuses refactoring with functionality. I thought it was a refactoring question. – barq Feb 12 '15 at 11:25

1 Answers1

0

But it seems obviously why you don't have the expected order: you're reading the file line by line, and of course you'll get the right line for pat0 only at the third cycle (so, after you processed the first and the second).

You probably should create an utility object which helps to re-arrange lines after parsing. The problem is to find a group identifier within sort its lines. I'm not able to read your alphabet and so I cannot help in this.

When you have a "group id" you're able to create a java.lang.Comparable object which use group id and pattern number to have the right order when it is put into Set. At the end of parsing you can print out lines.

cigno5.5
  • 703
  • 4
  • 13