0

I scraped some data using beautifulsoup, and saved as .txt file. The data is movie reviews from IMDB.com I found a good word counting python code, so I could make a word frequency excel table. However, I could not draw graph just using frequency table.

I want to draw semantic network graph using UCINET (node size should be based on betweenness centrality.)

My question is how to make text file into adjacency matrix data to draw UCINET graph. like this http://www.umasocialmedia.com/socialnetworks/wp-content/uploads/2012/09/senatorsxsenators1.png I want to draw network graph using the words which is used from reviewers.

(calculate the frequency if two words came up in the same sentence, when they are matched row and column line)

Or. Could you tell me how to draw network graph (using betweenness Centrality) in Python Code??

Rejeena
  • 3
  • 2
  • What are you defining as 'adjacency' in this case? Your question is unclear. – JDong May 23 '15 at 17:51
  • @JDong I am sorry for my poor English skill. Reviewer write this sentence. "X-Men is awesome". There is 20x20 matrix. Second row is "X-Men", and fourth column is "awesome". If there is "X-Men is awsome" in the data, (2,4) in the matrix will be add 1 frequency. – Rejeena May 23 '15 at 18:05
  • Just to confirm, your adjacency matrix will be symmetric because you have an undirected graph, correct? – JDong May 23 '15 at 18:27
  • @JDong Yes. You are correct. Symmetric matrix. – Rejeena May 23 '15 at 18:40

1 Answers1

1

Make a 2D 20x20 array, loop through each input string, and update your matrix using that string:

adjacency_matrix = [[0 for _ in range(20)] for _ in range(20)]


def get_lines(filename):
    """Returns the lines in the file"""
    with open(filename, 'r') as fp:
        return fp.readlines()


def update_matrix(matrix, mapping, string):
    """Update the given adjacency matrix using the given string."""                                        
    words = [_ for _ in re.split("\s+", string) if _ in mapping.keys()]            
    for word_1 in words:                                                           
        for word_2 in words:                                                       
            matrix[mapping[word_1]][mapping[word_2]] += 1


if __name__ == "__main__":
    words_in_matrix = ["X-men", "awesome", "good", "bad", ... 16 more ...]
    mapping = {word: index for index, word in enumerate(words_in_matrix)}

    for line in get_lines("ibdb.txt"):
        update_matrix(adjacency_matrix, mapping, line)
    print(adjacency_matrix)

A function similar to update_matrix may be useful, with matrix as your adjacency matrix, mapping a mapping of words to indices in your adjacency matrix, and string your sample review.

You will need to modify this to your needs. Inputs may have periods or other noise characters, which will need to be stripped.

JDong
  • 2,304
  • 3
  • 24
  • 42
  • oh... Thank you so much for your answer. But, I am a python beginner. I don't know how to apply your code in my data. My data is text file including more than 100 reviews which consist of more than 10 lines of text. – Rejeena May 23 '15 at 18:24
  • Teaching the basics of python isn't in the scope of this problem, sorry. However, I will provide you with some more sample code so you can get started. – JDong May 23 '15 at 18:29
  • wow. thanks. In your example, what is the 'X-men': 0 ? If there is sentence like "X-men is awesome", Is not 'X-men' word count '1'? – Rejeena May 23 '15 at 18:33
  • And I have another question. How to use 'update_matrix' in whole txt. file, not just one certain sentence like "X-men is awesome"? – Rejeena May 23 '15 at 18:36
  • Your adjacency matrix only holds adjacency data. The rows and columns are unlabeled. The labels for the rows and columns are stored in `mapping`. The new example code should be more clear, in the `__main__` section. – JDong May 23 '15 at 18:44
  • @Rejeena if this was helpful, please vote it as useful. If it answered your question, please mark your question as resolved. – JDong May 23 '15 at 20:07
  • I am very sorry for bothering you. I run the code, and get result like this "[10, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 84, 2, 0, 3, 3, 0, 0, 1, 2, 5, 0, 0, 0, 0, 0, 0, 1, 1, 1], ..." How to convert this into excel? How to split this lists into single number. like 10 0 0 0 1 0 allocate into each excel cell. – Rejeena May 24 '15 at 05:45
  • There are lots of ways. You can look it up, for example: https://stackoverflow.com/questions/13437727/python-write-to-excel-spreadsheet. You could also convert to CSV then find a CSV to excel converter. – JDong May 24 '15 at 06:05
  • yeah! Thanks. Conver to CSV and than convert to excel! – Rejeena May 24 '15 at 06:23