PolyU X RADICA BigDatathon 2018 - UBS Challenge (26-27 May 2018)

It was a great honor for us to present to David Ong, Executive Director of UBS AG for the UBS challenge on 27th May 2018. With only 12 hours of working, it was difficult to finish the task. It was a great experience for us to know our strengths and weakness and to learn from others. It was a great surprise that we were awarded the Best Presentation Award and our classification had the highest accuracy among all the competitors!

There is a lot of content generated from social media channels such as Weibo in China every day. For marketing purposes, UBS would like to know what is the content that the netizens feel interested most in different cities of China. With the results, UBS could focus to host more relevant client outreach events based on their interests.

Over 6,000 Weibo posts (unstructured data) as extracted from 8 KOLs between 1st Feb and 30th Apr 2018 were provided. These posts should be grouped into different categories of interest first for easier analysis. 

Challengers are required to:

A) Classify these Weibo posts into 13 categories of interest according to the below category ID:

0. Stock
1. Bond
2. Oil
3. Gold
4. Real Estate
5. Chinese Art (painting/ drawing/ calligraphy)
6. Western Art (painting/ drawing/ calligraphy)
7. Jewelry
8. Artifacts
9. Golf
10. Car
11. Overseas Education
12. Young Children Education

B) Sort out the ranking of the 13 interest categories (a) within each city and (b) across the cities according to their popularity based on the social network’s likes, retweets, comments and the city of the commentators of each post. The most popular interest should be put at first.

C) Present the business insights to further value-add to UBS business for a marketing solution.

Analysis & Solutions:

Data Preprocessing:

  • Remove Duplication of Data 
    • 7655 posts were provided by the organizer. However, some data were duplicated or have missing values. As there was limited time to complete the challenge, to obtain results quickly, data with low quality were deleted.
  • Text Preprocessing 
    • Stopwords removal: meaningless words or phrases were removed like but, and, ...
    • Punctuation, white spaces and other special characters were removed
    • Sentences were tokenized into sequential phrases like bi-gram, tri-gram.... This was done so to retain the order of words which were important to interpret the complete meaning of the sentences. 
  • Data Modelling:
    • Data were split into train dataset and test dataset
    • A list of train label is given by the organizer with a sample of 10% of the original dataset. 

    • Some categories have a very low number of counts in the training dataset.

    • Data manipulation was done to get a more balanced training set. Careful checking of data quality was needed beforehand so as to ensure the labeling of the training sample was mostly correct.
  • Model: 1-Dimension Convolutional Neural Networking
    • Good for training sequential data
    • Word embedding: Wikipedia
  • Model Evaluation: Results (based on validation set)
    • Accuracy = ~90%
    • Loss = ~10%

Dashboard (Power BI):

Some interpretation:

  • This page gives you an overview of the words or phrases used by the given KOLs. You can use the filter to see what kind of words are mostly used by the KOLs and what phrases usually receive a higher number of likes, comments or retweets.

  • This page is an overview of the category analysis. We proposed a special calculation formula for the score, i.e. 1 mark for each like, 3 marks for each comment, 6 marks for each retweet. Thus, if a Weibo user gave a like, a comment and a retweet for a particular post. This is counted as 10 marks.
  • In China, Oversea Education receives the highest score among all the categories. This shows that this is the category that most Chinese Weibo users concern about.
  • When a filter applies, like in the picture, Sichuan is selected, and the composition of the score is shown on the left-hand side.
  • This page allows you to do some comparison between categories, between regions or cities or any other combination of filters


Best Presentation


Post a Comment