When Moore's Law ENDS

I’m a chip designer working on the digital side. I’ve got experience with

CPU/SoC architecture and design, especially RISC-V open ISA
IC design/verification with Verilog/SystemVerilog/SystemC
Low power design and optimization
ASIC design flow, including front-end, back-end and power sign-off
Semi-custom design flow, including transistor timing analysis and SPICE simulation

Currently my interests are

Harware and software co-design
SoC generator
Machine learning accelerator

If you share the same interest and want a discussion, please send me a message on LinkedIn

Work expereince

Facebook AR/VR
- As SoC Lead
- From Feb. 2019 to Present
OURS Technology Inc.
- As Founding Engineer
- From Aug. 2017 to Feb. 2019
Marvell Semiconductor Inc.
- As Staff Engineer and Design Manager
- From Feb. 2012 to Aug. 2017
…

https://developer.arm.com/technologies/big-little big.LITTLE is a practical example of SMP (Symmetric Multiprocessing). It combines high performance CPU cores and low power CPU cores in the same chip, connected using cache coherent interconnect, to achieve high peak performance within thermal bounds of the system when intense computational power is needed, as well as maximum energy efficiency when the device is in light usage mode most of the time. It’s a particular adaption to mobile devices usage.

There is a big difference between how I used to understand hardware security and state-of-the-art security supported by hardware software co-design, after I watched some video talking about SEP (Security Enclave Processor) by Apple. It’s a key component in current iPhone to protect user data and password from being observed in any kind of hacking, including traditional side channel attack such as DPA (Dynamic Power Attack), debug channel attack, normal network attack, and etc.

Because I was job hunting recently, I’ve got lots of appointments, either phone call or face-to-face. There are two HR’s who gave me particularly deep impression, and they are on totally opposite end of professionalism. A good example Wheneven she wanted to arrange a phone call with me, she sent me emails in ahead to confirm time. Never call me directly without appointment. At the exact time of the appointment, ex.

What Are My Strengths? Feedback analysis The only way to discover your strengths is through feedback analysis. Whenever you make a key decision or take a key action, write down what you expect will happen. 9 or 12 months later, compare the actual results with your expectations. Follow-up actions Find out what are your strength and weakness, using feedback analysis. So that you can Concentrate on your strengths. Put yourself where your strengths can produce results Work on improving your strengths.

在创业公司的缘故，所以即使作为工程师也需要跟客户直接交流。去年夏天，“有幸”在最炎热的日子去中国南方某城市拜访某“重要”客户。同行的是我们中国分公司当时的唯一员工老牛同志。就这个拜访而言，整个过程还算比较成功，虽然后期合作项目因为种种原因失败了。但这个无关这篇“见闻”。这个见闻很短，事情前后不过一分钟而已，却让我印象很深。老牛跟我从北京乘飞机于上午抵达，然后坐出租车赶往客户公司。客户老总因为是校友的情面邀请我们一起吃午饭。出租车停下了，我先下车，因为老牛同志要负责付车钱。我见对方老总亲自出来迎接我们，很自然的热情的上前欲握手。但是他完全没有看着我，越过我径直上前跟老牛同志握手。让我伸出的手停在半空，表情很尴尬。之后我才知道，客户老总默认了因为老牛同志不是做技术的，必然是我的领导，所以即使把我晾在半空也毫不介意。且不说老牛与我本无上下级关系，因为创业公司嘛，本来就是扁平化管理。而且我在美国公司，他在中国公司，本就不搭架的。见闻叙述完毕。过去小半年了，今天突然想到这件事情，缘起于今天看到的一片知乎文章。它讲的是德国的年迈工程师如何“站着把钱挣了“的事情，从合作项目中的多个细节和点滴事例中管中窥豹。回想起这件事情让我感同身受。进而感慨，即使这样一个由工程师创建的、以高科技为主导的、全体员工大部分是工程师的高科技公司的老板竟是如此。工程师在中国大陆的地位可见一斑。因为没有匹配的经济回报和社会地位，稍微有经验一点的工程师就必须要转型为管理者才能维持尊严。而成为管理者之后，更多的时间花费在务虚的事情上了，自然很难有精力来钻研技术，甚至很难跟上技术发展的步伐。所以才会有那么多”外行管理内行“的抱怨。如此以往，何谈技术积累？如果真的要战略转型到高科技产业，就必须要把尊重工程师这件事情落到实处。但是，经济手段容易从上而下的贯彻，人们内心中多年以来的偏见又怎么纠正呢？那就只有”矫枉过正“了。将经济手段发挥到极致，让工程师的收入真正普遍达到高收入阶层水平。假以时日，社会地位的提升自然而然、顺理成章。这样才能吸引住更多有水平的工程师继续钻研技术、积累技术、培养新人。

Intro Device tree: for non-discoverable hardware, included in BSP Source type Old style: C code BSP, files compiled into the kernel New style: device-tree BSP -> device tree blob (load by boot loader) Compilation In-tree vs. out-of-tree dtc command convert .dts to .dtb, and backwards Device tree syntax devicetree.org Nodes Properties Values Root node = /

Trying to find the perfect static site generator. Used to use Pelican, because it’s written in Python. Also tried with Jekyll, the most popular candidate, because it’s used by Github. Their common problems are Not intuitive enough. Seems like something programmer created for programmer. There are very restricted requirements of directory structure of the content. And they doesn’t fit my understanding/requirement. Now I’m trying Hugo, which is written in Go, and it’s really fast.

递了notice之后，心中还是有点失落的。这种空落落的感觉更多的像是一种没有归属的感觉。自己已经不再认同自己是旧组织的成员，但是又尚未加入真正加入新的组织。这种两边不靠的状态让我感觉空落落的。是工作给人带来的异化？真的没有工作岗位或者组织就没法定义自己了？亦或者工作定义了一个人的社会属性，定义了他在这个社会中的角色？我在某个公司工作，那我就应该如何如何……我从事工程师的岗位，那我就应该如何行事如何穿着如何待人接物……就这样在一个框架中被定义了。其实每个人的背后的故事都很精彩很独特，怎么能够被几个框架定义呢？每个人都应该从这种独特中发现自己或者他人的精彩。没有从属的组织，离开了曾经朝夕相处的同事，就没有归属感？回归家庭吧！同事、工作、关系网能够定义一个人的大部分，但是只有家庭成员才不会计较你的这些与工作相关的社会属性，回归到丈夫、父亲、儿子。这些身份非常非常重要，因为它们才是伴随一生的角色。

Keynote panel on RISC-V Summit 2018: opportunities and challenges in security for open source hardware Complex systems tend to have bugs, so making it preparatory will make it more secure from attacks. But open source simple systems attract more eyes to review so that it becomes more and more secure over time. Military and government customers want secret projects. We need to change the way we design system

From the reading of this paper, “The Hwacha Microarchitecture Manual, Version 3.8.1”, I found out that our Pygmy ES1 architecture is almost the same idea, just not as fancy. We don’t have cache coherency, because we operate on unified physical memory space The vector execution unit is not as fancy, just useless multithreading, no systolic The prefetch is supposed to be done by DMA engine that is manually controlled by software System architecture The vector accelerator only has L1 I$, no D$ Don’t need to maintain cache coherency Lots of vector registers, 512 in total, each is 64x2x4=512-bit Wide bus connection to L2$, to provide higher bandwidth Uncached TileLink between L1 I$ (both scalar processor and vector processors) and L2$ Cached TileLink between L1 D$ (only in scalar processor) L2$ maintains directory bits, which determines the states of corresponding cache line (JW: maybe something like MESI bits) Operations of L2$, supported by TileLink protocol (“Productive Design of Extensible On-Chip Memory Hierarchies”) Sub-cache-block accesses Data prefetch requests Read from DDR, don’t need to send back to the requester Atomic memory operations ALU inside L2 cache banks Decoupling Access/execute decoupling Decoupled vector arch Cache refill/access decoupling Vector Command Queue (VCMDQ) Instruction fetch is handled by scalar processor, and then sent to VCMDQ There is explicity defined start vf and stop vstop instructions that flags the begin and end of vector instructions JW: why is that necessary?

Work expereince

ARM's big.LITTLE Architecture

Re-discover Hardware Security in Modern SoC

On Time Is Professional

HBR's 10 Must Read "Managing Oneself" Note

中国科技公司见闻一例

Working with Device Tree (DOULOS)

Static Site Generator

Give notice to OURS

Hardware Security

Huwcha Accelerator architecture